The outbreak of the novel Corona virus disease 2019 (COVID-19) was declared a public health emergency of international concern by the World Health Organization (WHO) on January 30, 2020. Upwards of 112 million cases have been confirmed worldwide, with nearly 2.5 million associated deaths. Within the US alone, there have been over 500,000 deaths and upwards of 28 million cases reported. Governments around the world have implemented and suggested a number of policies to lessen the spread of the pandemic, including mask-wearing requirements, travel restrictions, business and school closures, and even stay-at-home orders. The global pandemic has impacted the lives of individuals in countless ways, and though many countries have begun vaccinating individuals, the long-term impact of the virus remains unclear.
The impact of COVID-19 on a given segment of the population appears to vary drastically based on the socioeconomic characteristics of the segment. In particular, differing rates of infection and fatalities have been reported among different racial groups, age groups, and socioeconomic groups. One of the most important metrics for determining the impact of the pandemic is the death rate, which is the proportion of people within the total population that die due to the the disease.
We assemble this dataset for our research with the goal to investigate the effectiveness of lockdown on flattening the COVID curve. We provide a portion of the cleaned dataset for this case study.
There are two main goals for this case study.
Remark1: The data and the statistics reported here were collected before February of 2021.
Remark 2: A group of RAs spent tremendous amount of time working together to assemble the data. It requires data wrangling skills.
Remark 3: Please keep track with the most updated version of this write-up.
The data comes from several different sources:
In this case study, we use the following three nearly cleaned data:
Among all data, the unique identifier of county is FIPS.
The cleaning procedure is attached in Appendix 2: Data cleaning You may go through it if you are interested or would like to make any changes.
It may need more data wrangling.
First read in the data.
# county-level socialeconomic information
county_data <- fread("data/covid_county.csv")
# county-level COVID case and death
covid_rate <- fread("data/covid_rates.csv")
# county-level lockdown dates
# covid_intervention <- fread("data/covid_intervention.csv")The detailed description of variables is in Appendix 1: Data description. Please get familiar with the variables. Summarize the two data briefly.
# explore county data
# head(county_data)
# dim(county_data)
# str(county_data)
# summary(county_data)
skim(county_data)| Name | county_data |
| Number of rows | 3279 |
| Number of columns | 208 |
| Key | NULL |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 206 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| State | 0 | 1 | 2 | 2 | 0 | 53 | 0 |
| County | 0 | 1 | 3 | 31 | 0 | 1947 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| FIPS | 0 | 1.00 | 31341.51 | 1.63e+04 | 0.00e+00 | 19020.00 | 30019.00 | 46101.50 | 7.22e+04 | ▅▇▇▇▁ |
| MedHHInc | 86 | 0.97 | 52944.90 | 1.39e+04 | 2.54e+04 | 43783.00 | 50748.00 | 59020.00 | 1.40e+05 | ▆▇▁▁▁ |
| PerCapitaInc | 6 | 1.00 | 26719.97 | 6.95e+03 | 5.97e+03 | 22553.00 | 26169.00 | 30106.00 | 7.28e+04 | ▁▇▂▁▁ |
| PovertyUnder18Pct | 86 | 0.97 | 21.05 | 8.86e+00 | 2.50e+00 | 14.50 | 20.10 | 26.20 | 6.83e+01 | ▅▇▂▁▁ |
| PovertyAllAgesPct | 86 | 0.97 | 15.13 | 6.09e+00 | 2.60e+00 | 10.80 | 14.10 | 18.30 | 5.40e+01 | ▆▇▂▁▁ |
| Deep_Pov_All | 6 | 1.00 | 7.15 | 4.56e+00 | 0.00e+00 | 4.50 | 6.16 | 8.22 | 4.40e+01 | ▇▂▁▁▁ |
| Deep_Pov_Children | 7 | 1.00 | 10.24 | 7.54e+00 | 0.00e+00 | 5.44 | 8.69 | 12.56 | 6.92e+01 | ▇▂▁▁▁ |
| PovertyUnder18Num | 86 | 0.97 | 12211.90 | 2.36e+05 | 5.00e+00 | 503.00 | 1200.00 | 2966.00 | 1.30e+07 | ▇▁▁▁▁ |
| PovertyAllAgesNum | 86 | 0.97 | 39322.57 | 7.58e+05 | 5.00e+00 | 1621.00 | 3876.00 | 9874.00 | 4.19e+07 | ▇▁▁▁▁ |
| PopChangeRate1819 | 84 | 0.97 | 0.10 | 1.28e+00 | -1.55e+01 | -0.55 | 0.04 | 0.69 | 1.42e+01 | ▁▁▇▁▁ |
| PopChangeRate1019 | 6 | 1.00 | 0.70 | 8.86e+00 | -3.35e+01 | -4.42 | -0.62 | 4.44 | 1.34e+02 | ▇▆▁▁▁ |
| TotalPopEst2019 | 6 | 1.00 | 301847.27 | 5.87e+06 | 8.60e+01 | 11317.00 | 26687.00 | 71686.00 | 3.28e+08 | ▇▁▁▁▁ |
| NetMigrationRate1019 | 85 | 0.97 | 0.01 | 7.58e+00 | -3.22e+01 | -4.07 | -1.15 | 3.17 | 1.16e+02 | ▅▇▁▁▁ |
| NaturalChangeRate1019 | 85 | 0.97 | 1.03 | 3.66e+00 | -1.10e+01 | -1.36 | 0.59 | 3.01 | 2.31e+01 | ▁▇▃▁▁ |
| Net_International_Migration_Rate_2010_2019 | 85 | 0.97 | 0.92 | 1.60e+00 | -1.25e+00 | 0.10 | 0.40 | 1.10 | 2.13e+01 | ▇▁▁▁▁ |
| PopChangeRate0010 | 12 | 1.00 | 5.20 | 1.30e+01 | -4.66e+01 | -2.52 | 3.18 | 10.16 | 1.10e+02 | ▁▇▁▁▁ |
| NetMigrationRate0010 | 95 | 0.97 | 1.00 | 1.11e+01 | -5.74e+01 | -5.12 | -0.35 | 5.35 | 8.52e+01 | ▁▇▇▁▁ |
| NaturalChangeRate0010 | 95 | 0.97 | 2.84 | 4.71e+00 | -1.45e+01 | -0.26 | 2.36 | 5.27 | 2.64e+01 | ▁▇▇▁▁ |
| Immigration_Rate_2000_2010 | 95 | 0.97 | 1.25 | 1.82e+00 | -5.19e+00 | 0.20 | 0.58 | 1.56 | 1.77e+01 | ▁▇▁▁▁ |
| PopDensity2010 | 6 | 1.00 | 284.68 | 1.72e+03 | 4.00e-02 | 17.79 | 47.05 | 131.29 | 6.95e+04 | ▇▁▁▁▁ |
| Under18Pct2010 | 6 | 1.00 | 23.46 | 3.34e+00 | 0.00e+00 | 21.49 | 23.43 | 25.13 | 4.16e+01 | ▁▁▇▃▁ |
| Age65AndOlderPct2010 | 6 | 1.00 | 15.79 | 4.15e+00 | 3.47e+00 | 13.06 | 15.41 | 18.08 | 4.34e+01 | ▂▇▂▁▁ |
| WhiteNonHispanicPct2010 | 6 | 1.00 | 76.30 | 2.29e+01 | 2.00e-01 | 65.34 | 84.99 | 94.00 | 9.92e+01 | ▁▁▂▃▇ |
| BlackNonHispanicPct2010 | 6 | 1.00 | 8.57 | 1.43e+01 | 0.00e+00 | 0.38 | 1.79 | 9.53 | 8.54e+01 | ▇▁▁▁▁ |
| AsianNonHispanicPct2010 | 6 | 1.00 | 1.15 | 2.54e+00 | 0.00e+00 | 0.26 | 0.46 | 1.00 | 4.30e+01 | ▇▁▁▁▁ |
| NativeAmericanNonHispanicPct2010 | 6 | 1.00 | 1.82 | 7.47e+00 | 0.00e+00 | 0.19 | 0.30 | 0.62 | 9.50e+01 | ▇▁▁▁▁ |
| HispanicPct2010 | 6 | 1.00 | 10.51 | 1.90e+01 | 0.00e+00 | 1.63 | 3.46 | 9.16 | 9.97e+01 | ▇▁▁▁▁ |
| MultipleRacePct2010 | 6 | 1.00 | 2.02 | 1.62e+00 | 1.00e-01 | 1.16 | 1.64 | 2.40 | 2.95e+01 | ▇▁▁▁▁ |
| NonHispanicWhitePopChangeRate0010 | 12 | 1.00 | 0.54 | 1.38e+01 | -9.43e+01 | -6.52 | -0.68 | 5.82 | 1.49e+02 | ▁▇▅▁▁ |
| NonHispanicBlackPopChangeRate0010 | 73 | 0.98 | 69.62 | 4.85e+02 | -1.00e+02 | -3.30 | 13.59 | 61.29 | 2.20e+04 | ▇▁▁▁▁ |
| NonHispanicAsianPopChangeRate0010 | 51 | 0.98 | 74.23 | 1.29e+02 | -1.00e+02 | 17.80 | 47.31 | 91.55 | 2.90e+03 | ▇▁▁▁▁ |
| NonHispanicNativeAmericanPopChangeRate0010 | 35 | 0.99 | 17.66 | 6.64e+01 | -1.00e+02 | -8.83 | 9.07 | 30.22 | 1.50e+03 | ▇▁▁▁▁ |
| HispanicPopChangeRate0010 | 12 | 1.00 | 81.91 | 9.10e+01 | -1.00e+02 | 32.50 | 66.41 | 106.35 | 1.74e+03 | ▇▁▁▁▁ |
| MultipleRacePopChangeRate0010 | 14 | 1.00 | 59.24 | 5.72e+01 | -8.75e+01 | 24.83 | 53.51 | 83.58 | 8.50e+02 | ▇▂▁▁▁ |
| WhiteNonHispanicNum2010 | 6 | 1.00 | 180407.91 | 3.50e+06 | 2.40e+01 | 7946.00 | 19996.00 | 55254.00 | 1.97e+08 | ▇▁▁▁▁ |
| BlackNonHispanicNum2010 | 6 | 1.00 | 34543.53 | 6.75e+05 | 0.00e+00 | 62.00 | 664.00 | 5528.00 | 3.77e+07 | ▇▁▁▁▁ |
| AsianNonHispanicNum2010 | 6 | 1.00 | 13259.48 | 2.71e+05 | 0.00e+00 | 30.00 | 115.00 | 665.00 | 1.45e+07 | ▇▁▁▁▁ |
| NativeAmericanNonHispanicNum2010 | 6 | 1.00 | 2059.77 | 4.05e+04 | 0.00e+00 | 34.00 | 100.00 | 401.00 | 2.25e+06 | ▇▁▁▁▁ |
| HispanicNum2010 | 6 | 1.00 | 47406.95 | 9.44e+05 | 0.00e+00 | 275.00 | 1001.00 | 5159.00 | 5.05e+07 | ▇▁▁▁▁ |
| MultipleRaceNum2010 | 6 | 1.00 | 8295.29 | 1.63e+05 | 1.00e+00 | 157.00 | 432.00 | 1514.00 | 9.01e+06 | ▇▁▁▁▁ |
| ForeignBornPct | 85 | 0.97 | 4.80 | 5.74e+00 | 0.00e+00 | 1.37 | 2.78 | 5.88 | 5.32e+01 | ▇▁▁▁▁ |
| ForeignBornEuropePct | 85 | 0.97 | 0.56 | 7.10e-01 | 0.00e+00 | 0.16 | 0.35 | 0.69 | 7.71e+00 | ▇▁▁▁▁ |
| ForeignBornMexPct | 85 | 0.97 | 1.93 | 3.54e+00 | 0.00e+00 | 0.17 | 0.65 | 1.99 | 3.95e+01 | ▇▁▁▁▁ |
| NonEnglishHHPct | 6 | 1.00 | 3.57 | 1.12e+01 | 0.00e+00 | 0.32 | 0.91 | 2.28 | 8.95e+01 | ▇▁▁▁▁ |
| Ed1LessThanHSPct | 6 | 1.00 | 13.70 | 6.65e+00 | 1.18e+00 | 8.81 | 12.22 | 17.59 | 6.63e+01 | ▇▅▁▁▁ |
| Ed2HSDiplomaOnlyPct | 6 | 1.00 | 34.09 | 7.17e+00 | 5.47e+00 | 29.55 | 34.26 | 39.07 | 5.56e+01 | ▁▂▇▇▁ |
| Ed3SomeCollegePct | 6 | 1.00 | 21.56 | 4.14e+00 | 4.12e+00 | 18.98 | 21.58 | 24.02 | 3.87e+01 | ▁▂▇▂▁ |
| Ed4AssocDegreePct | 6 | 1.00 | 8.93 | 2.64e+00 | 1.12e+00 | 7.16 | 8.67 | 10.53 | 2.14e+01 | ▁▇▆▁▁ |
| Ed5CollegePlusPct | 6 | 1.00 | 21.72 | 9.40e+00 | 0.00e+00 | 15.19 | 19.47 | 26.01 | 7.85e+01 | ▃▇▂▁▁ |
| AvgHHSize | 6 | 1.00 | 2.53 | 2.80e-01 | 1.34e+00 | 2.36 | 2.49 | 2.65 | 4.97e+00 | ▁▇▁▁▁ |
| FemaleHHPct | 6 | 1.00 | 11.38 | 4.64e+00 | 0.00e+00 | 8.37 | 10.59 | 13.38 | 4.08e+01 | ▃▇▁▁▁ |
| HH65PlusAlonePct | 6 | 1.00 | 12.56 | 3.09e+00 | 2.78e+00 | 10.64 | 12.39 | 14.32 | 3.18e+01 | ▁▇▃▁▁ |
| OwnHomePct | 6 | 1.00 | 71.31 | 8.22e+00 | 4.26e+00 | 67.36 | 72.42 | 76.91 | 9.24e+01 | ▁▁▁▇▅ |
| ForeignBornNum | 84 | 0.97 | 40883.53 | 8.12e+05 | 0.00e+00 | 189.00 | 672.00 | 3189.50 | 4.35e+07 | ▇▁▁▁▁ |
| TotalPopACS | 6 | 1.00 | 297005.32 | 5.77e+06 | 7.50e+01 | 11411.00 | 26719.00 | 71359.00 | 3.23e+08 | ▇▁▁▁▁ |
| ForeignBornAfricaPct | 84 | 0.97 | 2.77 | 1.45e+02 | 0.00e+00 | 0.00 | 0.05 | 0.19 | 8.18e+03 | ▇▁▁▁▁ |
| Ed3SomeCollegeNum | 6 | 1.00 | 41363.08 | 8.05e+05 | 1.20e+01 | 1650.00 | 3753.00 | 10484.00 | 4.50e+07 | ▇▁▁▁▁ |
| Ed2HSDiplomaOnlyNum | 6 | 1.00 | 54531.49 | 1.06e+06 | 1.50e+01 | 2910.00 | 6578.00 | 15874.00 | 5.93e+07 | ▇▁▁▁▁ |
| Ed1LessThanHSNum | 7 | 1.00 | 24884.60 | 4.86e+05 | 4.00e+00 | 1041.75 | 2652.00 | 6275.75 | 2.69e+07 | ▇▁▁▁▁ |
| TotalPop25Plus | 6 | 1.00 | 200959.07 | 3.90e+06 | 6.90e+01 | 7952.00 | 18362.00 | 47876.00 | 2.18e+08 | ▇▁▁▁▁ |
| Ed5CollegePlusNum | 6 | 1.00 | 63309.28 | 1.23e+06 | 0.00e+00 | 1262.00 | 3323.00 | 11159.00 | 6.89e+07 | ▇▁▁▁▁ |
| TotalOccHU | 7 | 1.00 | 110145.31 | 2.14e+06 | 3.30e+01 | 4381.25 | 10201.50 | 27118.50 | 1.20e+08 | ▇▁▁▁▁ |
| ForeignBornAsiaPct | 84 | 0.97 | 1.66 | 3.17e+01 | 0.00e+00 | 0.24 | 0.51 | 1.12 | 1.79e+03 | ▇▁▁▁▁ |
| Ed4AssocDegreeNum | 7 | 1.00 | 16885.70 | 3.27e+05 | 7.00e+00 | 648.00 | 1593.00 | 4591.00 | 1.83e+07 | ▇▁▁▁▁ |
| ForeignBornEuropeNum | 84 | 0.97 | 4487.14 | 8.78e+04 | 0.00e+00 | 21.00 | 83.00 | 388.00 | 4.78e+06 | ▇▁▁▁▁ |
| NonEnglishHHNum | 6 | 1.00 | 5133.96 | 9.82e+04 | 0.00e+00 | 19.00 | 89.00 | 476.00 | 5.32e+06 | ▇▁▁▁▁ |
| HH65PlusAloneNum | 6 | 1.00 | 11843.41 | 2.29e+05 | 4.00e+00 | 595.00 | 1266.00 | 3152.00 | 1.29e+07 | ▇▁▁▁▁ |
| OwnHomeNum | 6 | 1.00 | 70322.53 | 1.36e+06 | 2.00e+00 | 3241.00 | 7314.00 | 18964.00 | 7.64e+07 | ▇▁▁▁▁ |
| FemaleHHNum | 6 | 1.00 | 13891.26 | 2.69e+05 | 0.00e+00 | 457.00 | 1189.00 | 3235.00 | 1.51e+07 | ▇▁▁▁▁ |
| TotalHH | 7 | 1.00 | 110145.31 | 2.14e+06 | 3.30e+01 | 4381.25 | 10201.50 | 27118.50 | 1.20e+08 | ▇▁▁▁▁ |
| ForeignBornCentralSouthAmPct | 85 | 0.97 | 2.50 | 3.94e+00 | 0.00e+00 | 0.35 | 1.06 | 2.79 | 3.95e+01 | ▇▁▁▁▁ |
| ForeignBornCentralSouthAmNum | 85 | 0.97 | 16795.83 | 3.40e+05 | 0.00e+00 | 60.00 | 280.00 | 1499.50 | 1.79e+07 | ▇▁▁▁▁ |
| ForeignBornCaribPct | 85 | 0.97 | 0.24 | 1.03e+00 | 0.00e+00 | 0.00 | 0.04 | 0.15 | 3.20e+01 | ▇▁▁▁▁ |
| ForeignBornCaribNum | 85 | 0.97 | 3992.83 | 8.59e+04 | 0.00e+00 | 0.00 | 10.00 | 88.00 | 4.25e+06 | ▇▁▁▁▁ |
| ForeignBornAfricaNum | 85 | 0.97 | 2020.73 | 3.92e+04 | 0.00e+00 | 0.00 | 13.00 | 109.00 | 2.15e+06 | ▇▁▁▁▁ |
| ForeignBornAsiaNum | 85 | 0.97 | 12584.71 | 2.53e+05 | 0.00e+00 | 29.00 | 122.00 | 677.25 | 1.34e+07 | ▇▁▁▁▁ |
| ForeignBornMexNum | 85 | 0.97 | 10702.37 | 2.22e+05 | 0.00e+00 | 32.25 | 186.50 | 960.25 | 1.14e+07 | ▇▁▁▁▁ |
| LandAreaSQMiles2010 | 6 | 1.00 | 3240.24 | 6.33e+04 | 2.00e+00 | 421.36 | 610.41 | 932.66 | 3.53e+06 | ▇▁▁▁▁ |
| Age65AndOlderNum2010 | 6 | 1.00 | 37087.28 | 7.19e+05 | 1.20e+01 | 1915.00 | 4112.00 | 10092.00 | 4.03e+07 | ▇▁▁▁▁ |
| TotalPop2010 | 6 | 1.00 | 284129.62 | 5.52e+06 | 6.82e+01 | 11549.00 | 26890.00 | 70217.00 | 3.09e+08 | ▇▁▁▁▁ |
| Under18Num2010 | 6 | 1.00 | 68272.70 | 1.33e+06 | 0.00e+00 | 2699.00 | 6359.00 | 16380.00 | 7.42e+07 | ▇▁▁▁▁ |
| Net_International_Migration_2000_2010 | 95 | 0.97 | 5972.94 | 5.15e+04 | -4.97e+03 | 24.50 | 136.75 | 728.12 | 1.91e+06 | ▇▁▁▁▁ |
| NaturalChangeNum0010 | 95 | 0.97 | 10719.20 | 8.28e+04 | -2.73e+04 | -27.75 | 490.75 | 2546.38 | 3.10e+06 | ▇▁▁▁▁ |
| NetMigrationNum0010 | 95 | 0.97 | 5974.78 | 7.02e+04 | -8.39e+05 | -758.50 | -62.50 | 1945.50 | 2.11e+06 | ▁▇▁▁▁ |
| TotalPopEst2012 | 6 | 1.00 | 288777.18 | 5.61e+06 | 8.60e+01 | 11499.00 | 26901.00 | 70666.00 | 3.14e+08 | ▇▁▁▁▁ |
| TotalPopEst2013 | 6 | 1.00 | 290746.63 | 5.65e+06 | 8.90e+01 | 11455.00 | 26886.00 | 70906.00 | 3.16e+08 | ▇▁▁▁▁ |
| TotalPopEst2010 | 6 | 1.00 | 284671.05 | 5.53e+06 | 8.40e+01 | 11556.00 | 26854.00 | 70333.00 | 3.09e+08 | ▇▁▁▁▁ |
| TotalPopEst2014 | 6 | 1.00 | 292843.40 | 5.69e+06 | 8.90e+01 | 11504.00 | 26944.00 | 70993.00 | 3.18e+08 | ▇▁▁▁▁ |
| TotalPopEst2011 | 6 | 1.00 | 286706.51 | 5.57e+06 | 9.00e+01 | 11486.00 | 26826.00 | 70615.00 | 3.12e+08 | ▇▁▁▁▁ |
| Net_International_Migration_2010_2019 | 85 | 0.97 | 7218.64 | 1.41e+05 | -1.45e+03 | 12.00 | 81.00 | 514.75 | 7.69e+06 | ▇▁▁▁▁ |
| NaturalChange1019 | 85 | 0.97 | 10550.17 | 2.07e+05 | -3.04e+04 | -186.75 | 83.50 | 1322.75 | 1.12e+07 | ▇▁▁▁▁ |
| TotalPopEst2015 | 6 | 1.00 | 294963.74 | 5.73e+06 | 8.80e+01 | 11375.00 | 26827.00 | 71021.00 | 3.21e+08 | ▇▁▁▁▁ |
| TotalPopEst2016 | 6 | 1.00 | 297056.91 | 5.77e+06 | 8.80e+01 | 11408.00 | 26626.00 | 71381.00 | 3.23e+08 | ▇▁▁▁▁ |
| TotalPopEst2017 | 6 | 1.00 | 298905.41 | 5.81e+06 | 8.60e+01 | 11342.00 | 26676.00 | 71352.00 | 3.25e+08 | ▇▁▁▁▁ |
| NetMigration1019 | 85 | 0.97 | 7218.64 | 1.51e+05 | -6.78e+05 | -845.75 | -146.00 | 983.75 | 7.69e+06 | ▇▁▁▁▁ |
| TotalPopEst2018 | 6 | 1.00 | 300427.07 | 5.84e+06 | 8.60e+01 | 11293.00 | 26702.00 | 71117.00 | 3.27e+08 | ▇▁▁▁▁ |
| TotalPopEstBase2010 | 7 | 1.00 | 284229.97 | 5.52e+06 | 8.20e+01 | 11564.00 | 26908.50 | 70334.75 | 3.09e+08 | ▇▁▁▁▁ |
| UnempRate2019 | 7 | 1.00 | 4.14 | 1.79e+00 | 7.00e-01 | 3.00 | 3.70 | 4.70 | 1.93e+01 | ▇▃▁▁▁ |
| UnempRate2018 | 7 | 1.00 | 4.29 | 1.88e+00 | 1.30e+00 | 3.10 | 3.90 | 4.90 | 1.96e+01 | ▇▂▁▁▁ |
| UnempRate2017 | 7 | 1.00 | 4.80 | 2.18e+00 | 1.60e+00 | 3.50 | 4.40 | 5.40 | 2.06e+01 | ▇▂▁▁▁ |
| UnempRate2016 | 7 | 1.00 | 5.44 | 2.36e+00 | 1.70e+00 | 4.00 | 5.00 | 6.20 | 2.41e+01 | ▇▂▁▁▁ |
| UnempRate2015 | 7 | 1.00 | 5.74 | 2.47e+00 | 1.80e+00 | 4.20 | 5.30 | 6.60 | 2.45e+01 | ▇▃▁▁▁ |
| UnempRate2014 | 7 | 1.00 | 6.50 | 2.87e+00 | 1.20e+00 | 4.70 | 6.10 | 7.60 | 2.64e+01 | ▇▆▁▁▁ |
| UnempRate2010 | 7 | 1.00 | 9.60 | 3.50e+00 | 2.10e+00 | 7.30 | 9.30 | 11.50 | 2.88e+01 | ▃▇▂▁▁ |
| UnempRate2007 | 12 | 1.00 | 5.06 | 2.12e+00 | 1.50e+00 | 3.70 | 4.70 | 5.80 | 2.04e+01 | ▇▃▁▁▁ |
| PctEmpChange1019 | 7 | 1.00 | 4.90 | 1.75e+01 | -3.41e+01 | -2.83 | 3.88 | 10.89 | 7.03e+02 | ▇▁▁▁▁ |
| PctEmpChange1819 | 7 | 1.00 | 0.78 | 2.95e+00 | -2.88e+01 | -0.30 | 0.77 | 1.82 | 1.04e+02 | ▁▇▁▁▁ |
| PctEmpChange0719 | 12 | 1.00 | -0.28 | 2.87e+01 | -5.92e+01 | -10.29 | -1.67 | 7.29 | 1.32e+03 | ▇▁▁▁▁ |
| PctEmpChange0710 | 12 | 1.00 | -5.20 | 1.03e+01 | -5.50e+01 | -10.62 | -5.34 | -0.49 | 9.16e+01 | ▁▇▁▁▁ |
| PctEmpAgriculture | 7 | 1.00 | 4.97 | 6.23e+00 | 0.00e+00 | 1.10 | 2.78 | 6.08 | 5.96e+01 | ▇▁▁▁▁ |
| PctEmpMining | 7 | 1.00 | 1.55 | 3.42e+00 | 0.00e+00 | 0.07 | 0.29 | 1.24 | 4.40e+01 | ▇▁▁▁▁ |
| PctEmpConstruction | 7 | 1.00 | 7.28 | 2.36e+00 | 0.00e+00 | 5.74 | 6.98 | 8.48 | 2.55e+01 | ▂▇▁▁▁ |
| PctEmpManufacturing | 7 | 1.00 | 12.21 | 7.09e+00 | 0.00e+00 | 6.79 | 11.34 | 16.53 | 5.17e+01 | ▇▇▂▁▁ |
| PctEmpTrade | 7 | 1.00 | 13.71 | 2.74e+00 | 8.40e-01 | 12.27 | 13.86 | 15.30 | 3.89e+01 | ▁▇▂▁▁ |
| PctEmpTrans | 7 | 1.00 | 5.50 | 2.07e+00 | 0.00e+00 | 4.13 | 5.20 | 6.49 | 2.77e+01 | ▇▆▁▁▁ |
| PctEmpInformation | 7 | 1.00 | 1.38 | 8.00e-01 | 0.00e+00 | 0.86 | 1.30 | 1.74 | 1.23e+01 | ▇▁▁▁▁ |
| PctEmpFIRE | 7 | 1.00 | 4.57 | 1.92e+00 | 0.00e+00 | 3.33 | 4.29 | 5.49 | 2.06e+01 | ▇▇▁▁▁ |
| PctEmpServices | 7 | 1.00 | 43.15 | 7.03e+00 | 8.33e+00 | 38.49 | 43.02 | 47.74 | 8.16e+01 | ▁▂▇▁▁ |
| PctEmpGovt | 7 | 1.00 | 5.69 | 3.28e+00 | 0.00e+00 | 3.56 | 4.81 | 6.75 | 3.86e+01 | ▇▂▁▁▁ |
| NumCivEmployed | 7 | 1.00 | 140355.71 | 2.73e+06 | 3.60e+01 | 4644.00 | 10897.00 | 31172.00 | 1.53e+08 | ▇▁▁▁▁ |
| UnempRate2011 | 7 | 1.00 | 8.96 | 3.38e+00 | 1.40e+00 | 6.70 | 8.70 | 10.70 | 2.89e+01 | ▃▇▁▁▁ |
| NumCivLaborForce2011 | 7 | 1.00 | 142130.42 | 2.76e+06 | 6.60e+01 | 5206.50 | 12015.50 | 33389.75 | 1.55e+08 | ▇▁▁▁▁ |
| NumEmployed2011 | 7 | 1.00 | 129380.25 | 2.51e+06 | 6.20e+01 | 4719.50 | 10836.50 | 30331.50 | 1.41e+08 | ▇▁▁▁▁ |
| NumCivLaborForce2012 | 7 | 1.00 | 142592.65 | 2.77e+06 | 6.70e+01 | 5176.50 | 11960.00 | 33001.75 | 1.55e+08 | ▇▁▁▁▁ |
| NumUnemployed2010 | 7 | 1.00 | 13691.22 | 2.67e+05 | 4.00e+00 | 490.50 | 1258.00 | 3295.00 | 1.49e+07 | ▇▁▁▁▁ |
| NumCivLaborForce2008 | 12 | 1.00 | 141609.77 | 2.75e+06 | 4.30e+01 | 5352.00 | 12504.00 | 33373.00 | 1.54e+08 | ▇▁▁▁▁ |
| NumUnemployed2011 | 7 | 1.00 | 12750.17 | 2.48e+05 | 4.00e+00 | 459.00 | 1174.50 | 3017.00 | 1.38e+07 | ▇▁▁▁▁ |
| NumEmployed2010 | 7 | 1.00 | 128146.41 | 2.49e+06 | 6.70e+01 | 4745.50 | 10868.50 | 30205.25 | 1.39e+08 | ▇▁▁▁▁ |
| NumCivLaborForce2010 | 7 | 1.00 | 141837.63 | 2.76e+06 | 7.10e+01 | 5249.00 | 12098.00 | 33485.00 | 1.54e+08 | ▇▁▁▁▁ |
| NumUnemployed2009 | 12 | 1.00 | 13129.36 | 2.55e+05 | 4.00e+00 | 490.50 | 1228.00 | 3226.50 | 1.42e+07 | ▇▁▁▁▁ |
| NumEmployed2009 | 12 | 1.00 | 128525.70 | 2.50e+06 | 3.90e+01 | 4882.50 | 11311.00 | 30499.50 | 1.40e+08 | ▇▁▁▁▁ |
| NumCivLaborForce2009 | 12 | 1.00 | 141655.06 | 2.75e+06 | 4.30e+01 | 5357.00 | 12507.00 | 33371.00 | 1.54e+08 | ▇▁▁▁▁ |
| UnempRate2008 | 12 | 1.00 | 6.00 | 2.37e+00 | 1.30e+00 | 4.40 | 5.70 | 7.10 | 2.26e+01 | ▇▇▁▁▁ |
| UnempRate2012 | 7 | 1.00 | 8.08 | 3.15e+00 | 1.10e+00 | 6.00 | 7.80 | 9.70 | 2.74e+01 | ▃▇▁▁▁ |
| NumEmployed2008 | 12 | 1.00 | 133387.51 | 2.59e+06 | 4.00e+01 | 5004.50 | 11701.00 | 31461.50 | 1.45e+08 | ▇▁▁▁▁ |
| UnempRate2009 | 12 | 1.00 | 9.25 | 3.49e+00 | 2.10e+00 | 6.85 | 8.80 | 11.10 | 2.74e+01 | ▅▇▂▁▁ |
| NumUnemployed2008 | 12 | 1.00 | 8222.26 | 1.60e+05 | 3.00e+00 | 312.00 | 788.00 | 2059.00 | 8.90e+06 | ▇▁▁▁▁ |
| NumUnemployed2015 | 7 | 1.00 | 7636.74 | 1.48e+05 | 4.00e+00 | 287.75 | 708.00 | 1870.25 | 8.28e+06 | ▇▁▁▁▁ |
| NumUnemployed2019 | 7 | 1.00 | 5514.96 | 1.07e+05 | 4.00e+00 | 203.00 | 499.00 | 1328.50 | 5.98e+06 | ▇▁▁▁▁ |
| NumCivLaborforce2019 | 7 | 1.00 | 149873.54 | 2.92e+06 | 2.23e+02 | 4989.75 | 11809.00 | 33505.00 | 1.63e+08 | ▇▁▁▁▁ |
| NumUnemployed2018 | 7 | 1.00 | 5794.81 | 1.13e+05 | 4.00e+00 | 211.00 | 512.50 | 1376.50 | 6.29e+06 | ▇▁▁▁▁ |
| NumEmployed2018 | 7 | 1.00 | 142510.43 | 2.77e+06 | 2.05e+02 | 4773.00 | 11216.00 | 31821.75 | 1.55e+08 | ▇▁▁▁▁ |
| NumCivLaborforce2018 | 7 | 1.00 | 148305.25 | 2.88e+06 | 2.11e+02 | 5021.25 | 11777.00 | 33305.75 | 1.61e+08 | ▇▁▁▁▁ |
| NumUnemployed2017 | 7 | 1.00 | 6432.11 | 1.25e+05 | 5.00e+00 | 232.75 | 569.50 | 1566.25 | 6.98e+06 | ▇▁▁▁▁ |
| NumEmployed2017 | 7 | 1.00 | 140750.69 | 2.74e+06 | 9.50e+01 | 4736.75 | 11135.00 | 31643.00 | 1.53e+08 | ▇▁▁▁▁ |
| NumCivLaborforce2017 | 7 | 1.00 | 147182.81 | 2.86e+06 | 1.00e+02 | 5026.75 | 11735.00 | 33075.75 | 1.60e+08 | ▇▁▁▁▁ |
| NumUnemployed2016 | 7 | 1.00 | 7122.21 | 1.38e+05 | 4.00e+00 | 270.00 | 659.50 | 1772.25 | 7.72e+06 | ▇▁▁▁▁ |
| NumEmployed2016 | 7 | 1.00 | 138661.77 | 2.70e+06 | 8.20e+01 | 4722.25 | 11035.00 | 31228.75 | 1.51e+08 | ▇▁▁▁▁ |
| NumCivLaborforce2016 | 7 | 1.00 | 145783.98 | 2.84e+06 | 8.60e+01 | 5046.75 | 11670.00 | 33081.00 | 1.59e+08 | ▇▁▁▁▁ |
| NumCivLaborforce2013 | 7 | 1.00 | 142920.08 | 2.78e+06 | 7.50e+01 | 5122.75 | 11867.50 | 32764.75 | 1.55e+08 | ▇▁▁▁▁ |
| NumEmployed2015 | 7 | 1.00 | 136473.11 | 2.65e+06 | 7.30e+01 | 4715.00 | 10904.00 | 30803.00 | 1.49e+08 | ▇▁▁▁▁ |
| NumEmployed2012 | 7 | 1.00 | 131061.00 | 2.55e+06 | 6.30e+01 | 4734.00 | 10944.00 | 30453.00 | 1.43e+08 | ▇▁▁▁▁ |
| NumUnemployed2014 | 7 | 1.00 | 8868.10 | 1.72e+05 | 4.00e+00 | 323.00 | 815.50 | 2131.75 | 9.62e+06 | ▇▁▁▁▁ |
| NumEmployed2014 | 7 | 1.00 | 134473.83 | 2.62e+06 | 7.50e+01 | 4691.25 | 10841.50 | 30703.50 | 1.46e+08 | ▇▁▁▁▁ |
| NumCivLaborforce2014 | 7 | 1.00 | 143341.93 | 2.79e+06 | 7.90e+01 | 5046.75 | 11694.50 | 32630.25 | 1.56e+08 | ▇▁▁▁▁ |
| UnempRate2013 | 7 | 1.00 | 7.62 | 3.14e+00 | 1.20e+00 | 5.50 | 7.30 | 9.10 | 2.74e+01 | ▆▇▁▁▁ |
| NumUnemployed2013 | 7 | 1.00 | 10565.91 | 2.06e+05 | 4.00e+00 | 376.75 | 963.50 | 2495.25 | 1.15e+07 | ▇▁▁▁▁ |
| NumEmployed2019 | 7 | 1.00 | 144358.58 | 2.81e+06 | 2.12e+02 | 4778.75 | 11277.00 | 32125.50 | 1.57e+08 | ▇▁▁▁▁ |
| NumEmployed2013 | 7 | 1.00 | 132354.16 | 2.57e+06 | 7.10e+01 | 4697.25 | 10807.00 | 30454.50 | 1.44e+08 | ▇▁▁▁▁ |
| NumUnemployed2007 | 12 | 1.00 | 6507.96 | 1.26e+05 | 3.00e+00 | 261.50 | 650.00 | 1724.00 | 7.04e+06 | ▇▁▁▁▁ |
| NumEmployed2007 | 12 | 1.00 | 133672.82 | 2.59e+06 | 3.80e+01 | 5036.50 | 11744.00 | 31617.00 | 1.45e+08 | ▇▁▁▁▁ |
| NumCivLaborforce2007 | 12 | 1.00 | 140180.77 | 2.72e+06 | 4.10e+01 | 5307.00 | 12420.00 | 33129.00 | 1.52e+08 | ▇▁▁▁▁ |
| NumUnemployed2012 | 7 | 1.00 | 11531.64 | 2.25e+05 | 4.00e+00 | 405.00 | 1043.50 | 2685.75 | 1.25e+07 | ▇▁▁▁▁ |
| NumCivLaborforce2015 | 7 | 1.00 | 144109.86 | 2.80e+06 | 7.70e+01 | 5017.75 | 11598.00 | 32745.75 | 1.57e+08 | ▇▁▁▁▁ |
| RuralUrbanContinuumCode2013 | 57 | 0.98 | 4.94 | 2.72e+00 | 1.00e+00 | 2.00 | 6.00 | 7.00 | 9.00e+00 | ▇▅▁▇▅ |
| UrbanInfluenceCode2013 | 57 | 0.98 | 5.19 | 3.51e+00 | 1.00e+00 | 2.00 | 5.00 | 8.00 | 1.20e+01 | ▇▂▃▂▃ |
| RuralUrbanContinuumCode2003 | 54 | 0.98 | 5.06 | 2.70e+00 | 1.00e+00 | 3.00 | 6.00 | 7.00 | 9.00e+00 | ▆▅▁▇▅ |
| UrbanInfluenceCode2003 | 54 | 0.98 | 5.38 | 3.49e+00 | 1.00e+00 | 2.00 | 5.00 | 8.00 | 1.20e+01 | ▇▃▃▃▃ |
| Metro2013 | 57 | 0.98 | 0.38 | 4.90e-01 | 0.00e+00 | 0.00 | 0.00 | 1.00 | 1.00e+00 | ▇▁▁▁▅ |
| Nonmetro2013 | 57 | 0.98 | 0.62 | 4.90e-01 | 0.00e+00 | 0.00 | 1.00 | 1.00 | 1.00e+00 | ▅▁▁▁▇ |
| Micropolitan2013 | 57 | 0.98 | 0.20 | 4.00e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▂ |
| Type_2015_Update | 136 | 0.96 | 1.81 | 1.82e+00 | 0.00e+00 | 0.00 | 1.00 | 3.00 | 5.00e+00 | ▇▁▂▂▂ |
| Type_2015_Farming_NO | 136 | 0.96 | 0.14 | 3.50e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▁ |
| Type_2015_Manufacturing_NO | 136 | 0.96 | 0.16 | 3.70e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▂ |
| Type_2015_Mining_NO | 136 | 0.96 | 0.07 | 2.60e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▁ |
| Type_2015_Government_NO | 136 | 0.96 | 0.13 | 3.40e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▁ |
| Type_2015_Recreation_NO | 136 | 0.96 | 0.11 | 3.10e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▁ |
| Low_Education_2015_update | 136 | 0.96 | 0.15 | 3.60e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▂ |
| Low_Employment_2015_update | 136 | 0.96 | 0.29 | 4.50e-01 | 0.00e+00 | 0.00 | 0.00 | 1.00 | 1.00e+00 | ▇▁▁▁▃ |
| Population_loss_2015_update | 136 | 0.96 | 0.17 | 3.70e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▂ |
| Retirement_Destination_2015_Update | 136 | 0.96 | 0.14 | 3.50e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▁ |
| Perpov_1980_0711 | 136 | 0.96 | 0.11 | 3.20e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▁ |
| PersistentChildPoverty_1980_2011 | 136 | 0.96 | 0.23 | 4.20e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▂ |
| Hipov | 58 | 0.98 | 0.23 | 4.50e-01 | -9.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▁▁▁▁▇ |
| HiAmenity | 172 | 0.95 | 0.25 | 4.30e-01 | 0.00e+00 | 0.00 | 0.00 | 0.50 | 1.00e+00 | ▇▁▁▁▃ |
| HiCreativeClass2000 | 140 | 0.96 | 0.25 | 4.30e-01 | 0.00e+00 | 0.00 | 0.00 | 1.00 | 1.00e+00 | ▇▁▁▁▃ |
| Gas_Change | 170 | 0.95 | 0.56 | 1.99e+00 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 9.00e+00 | ▇▁▁▁▁ |
| Oil_Change | 170 | 0.95 | 0.42 | 1.77e+00 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 9.00e+00 | ▇▁▁▁▁ |
| Oil_Gas_Change | 170 | 0.95 | 0.75 | 2.29e+00 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 9.00e+00 | ▇▁▁▁▁ |
| Metro2003 | 54 | 0.98 | 0.36 | 4.80e-01 | 0.00e+00 | 0.00 | 0.00 | 1.00 | 1.00e+00 | ▇▁▁▁▅ |
| NonmetroNotAdj2003 | 54 | 0.98 | 0.31 | 4.60e-01 | 0.00e+00 | 0.00 | 0.00 | 1.00 | 1.00e+00 | ▇▁▁▁▃ |
| NonmetroAdj2003 | 54 | 0.98 | 0.33 | 4.70e-01 | 0.00e+00 | 0.00 | 0.00 | 1.00 | 1.00e+00 | ▇▁▁▁▃ |
| Noncore2003 | 54 | 0.98 | 0.43 | 5.00e-01 | 0.00e+00 | 0.00 | 0.00 | 1.00 | 1.00e+00 | ▇▁▁▁▆ |
| EconomicDependence2000 | 138 | 0.96 | 3.92 | 1.74e+00 | 1.00e+00 | 3.00 | 4.00 | 6.00 | 6.00e+00 | ▅▇▃▃▇ |
| Nonmetro2003 | 54 | 0.98 | 0.64 | 4.80e-01 | 0.00e+00 | 0.00 | 1.00 | 1.00 | 1.00e+00 | ▅▁▁▁▇ |
| Micropolitan2003 | 54 | 0.98 | 0.21 | 4.10e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▂ |
| FarmDependent2003 | 138 | 0.96 | 0.14 | 3.50e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▁ |
| ManufacturingDependent2000 | 138 | 0.96 | 0.29 | 4.50e-01 | 0.00e+00 | 0.00 | 0.00 | 1.00 | 1.00e+00 | ▇▁▁▁▃ |
| LowEducation2000 | 138 | 0.96 | 0.20 | 4.00e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▂ |
| RetirementDestination2000 | 138 | 0.96 | 0.14 | 3.50e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▁ |
| PersistentPoverty2000 | 138 | 0.96 | 0.12 | 3.30e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▁ |
| Noncore2013 | 57 | 0.98 | 0.42 | 4.90e-01 | 0.00e+00 | 0.00 | 0.00 | 1.00 | 1.00e+00 | ▇▁▁▁▆ |
| Type_2015_Nonspecialized_NO | 136 | 0.96 | 0.39 | 4.90e-01 | 0.00e+00 | 0.00 | 0.00 | 1.00 | 1.00e+00 | ▇▁▁▁▅ |
| Metro_Adjacent2013 | 57 | 0.98 | 0.32 | 4.70e-01 | 0.00e+00 | 0.00 | 0.00 | 1.00 | 1.00e+00 | ▇▁▁▁▃ |
| PersistentChildPoverty2004 | 138 | 0.96 | 0.23 | 4.20e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▂ |
| RecreationDependent2000 | 138 | 0.96 | 0.11 | 3.10e-01 | 0.00e+00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▁ |
# explore covid_rate
# head(covid_rate)
# dim(covid_rate)
# str(covid_rate)
# summary(covid_rate)
skim(covid_rate)| Name | covid_rate |
| Number of rows | 1008984 |
| Number of columns | 8 |
| Key | NULL |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| Date | 1 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| County | 0 | 1 | 2 | 21 | 0 | 1819 | 0 |
| State | 0 | 1 | 4 | 20 | 0 | 49 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2020-01-21 | 2021-02-20 | 2020-09-11 | 397 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| FIPS | 0 | 1 | 30571.7 | 15054.2 | 1001 | 19013 | 29179 | 45085 | 56045 | ▃▇▆▅▇ |
| cum_cases | 0 | 1 | 2932.8 | 14747.2 | 0 | 54 | 366 | 1557 | 1179633 | ▇▁▁▁▁ |
| cum_deaths | 0 | 1 | 65.6 | 322.9 | 0 | 1 | 7 | 31 | 19793 | ▇▁▁▁▁ |
| week | 0 | 1 | 34.1 | 13.6 | 1 | 23 | 34 | 46 | 57 | ▂▇▇▇▇ |
| TotalPopEst2019 | 0 | 1 | 112404.7 | 356632.8 | 169 | 12172 | 27772 | 74228 | 10039107 | ▇▁▁▁▁ |
Based on this summary, we note that:
county_data contain NAs.covid_rate dataset does not contain any NAs.# change county and state to factors in both datasets
county_data <- county_data %>%
mutate(State = as.factor(State),
County = as.factor(County))
covid_rate <- covid_rate %>%
mutate(State = as.factor(State),
County = as.factor(County))It is crucial to decide the right granularity for visualization and analysis. We will compare daily vs weekly total new cases by state and we will see it is hard to interpret daily report.
First, we start by creating a county_new_cases variable, defined as the row-wise difference of the existing cumulative variable cum_cases. Next, we create a state_new_cases variable, which is the sum of all new cases per day, across all counties of a given state.
# create county_new_cases variable
covid_rate <- covid_rate %>%
group_by(County, State) %>%
mutate(county_new_cases = cum_cases - lag(cum_cases)) %>%
mutate(county_new_cases = ifelse(is.na(county_new_cases), cum_cases, county_new_cases)) # the first new_cases entry for every county is an NA as there was no prior data point to compute a difference with. Thus, we replace NAs with the first reported number of cases that day.
# let's see if this new variable seems right
summary(covid_rate$county_new_cases)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2321 0 3 27 14 29174
# some values are negative! Let's see which:
covid_rate[which(covid_rate$county_new_cases < -1000), ]## # A tibble: 6 × 9
## # Groups: County, State [6]
## FIPS date County State cum_cases cum_deaths week TotalPopEst2019
## <int> <date> <fct> <fct> <int> <int> <int> <int>
## 1 12095 2020-12-08 Orange Florida 60290 641 47 1393452
## 2 25017 2020-09-02 Middlesex Massac… 25411 2070 33 1611699
## 3 25021 2020-09-02 Norfolk Massac… 9569 1019 33 706775
## 4 25025 2020-09-02 Suffolk Massac… 22508 1104 33 803907
## 5 48001 2020-09-18 Anderson Texas 1744 28 35 57735
## 6 6067 2021-01-17 Sacramento Califo… 79121 1080 53 1552058
## # … with 1 more variable: county_new_cases <int>
covid_rate[covid_rate$County == "Middlesex" & covid_rate$State == "Massachusetts" & (covid_rate$date == "2020-09-01" | covid_rate$date == "2020-09-02"), ]## # A tibble: 2 × 9
## # Groups: County, State [1]
## FIPS date County State cum_cases cum_deaths week TotalPopEst2019
## <int> <date> <fct> <fct> <int> <int> <int> <int>
## 1 25017 2020-09-01 Middlesex Massach… 27732 2071 33 1611699
## 2 25017 2020-09-02 Middlesex Massach… 25411 2070 33 1611699
## # … with 1 more variable: county_new_cases <int>
# it looks like some counties have large "jumps" in their cumulative cases, for instance in MS, Middlesex county reports 27732 on 09-01, but then 25411 the next day, which leads to a difference of -2321! We suspect this is a reporting error, or a correction for a number of suspected cases which turned out not be COVID?
# To correct for this in the county_new_cases, we replace all negative values by 0, indicating no new cases that day.
covid_rate$county_new_cases <- ifelse((covid_rate$county_new_cases) < 0,
0, # if value is smaller than 0, attribute 0 to this cell
covid_rate$county_new_cases) # else, just report the current value
summary(covid_rate$county_new_cases) # values seem reasonable now.## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 3 27 14 29174
covid_rate[covid_rate$County == "Middlesex" & covid_rate$State == "Massachusetts", ]## # A tibble: 353 × 9
## # Groups: County, State [1]
## FIPS date County State cum_cases cum_deaths week TotalPopEst2019
## <int> <date> <fct> <fct> <int> <int> <int> <int>
## 1 25017 2020-03-05 Middlesex Massac… 1 0 7 1611699
## 2 25017 2020-03-06 Middlesex Massac… 1 0 7 1611699
## 3 25017 2020-03-07 Middlesex Massac… 5 0 7 1611699
## 4 25017 2020-03-08 Middlesex Massac… 10 0 8 1611699
## 5 25017 2020-03-09 Middlesex Massac… 15 0 8 1611699
## 6 25017 2020-03-10 Middlesex Massac… 41 0 8 1611699
## 7 25017 2020-03-11 Middlesex Massac… 41 0 8 1611699
## 8 25017 2020-03-12 Middlesex Massac… 49 0 8 1611699
## 9 25017 2020-03-13 Middlesex Massac… 60 0 8 1611699
## 10 25017 2020-03-14 Middlesex Massac… 65 0 8 1611699
## # … with 343 more rows, and 1 more variable: county_new_cases <dbl>
# this provides the number of new cases per county though, and we need to know the total sum of new daily cases per state.
# So let's create a state_new_cases variable
daily_state <- covid_rate %>%
group_by(State, date) %>%
summarise(
state_new_cases = sum(county_new_cases, na.rm = TRUE)
)## `summarise()` has grouped output by 'State'. You can override using the
## `.groups` argument.
# mutate(state_new_cases = sum(county_new_cases, na.rm = TRUE))
# let's see if this new variable seems right
summary(daily_state$state_new_cases) # seems reasonable, no negative values## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 131 575 1588 1593 64987
tmp <- covid_rate[covid_rate$date == "2020-03-13" & covid_rate$State == "Alabama", ]
sum(tmp$county_new_cases)## [1] 6
# the variable seems correct.# plot new daily covid cases in NY, WA and FL
daily_state %>%
filter(State %in% c("New York", "Washington", "Florida")) %>%
ggplot(aes(x = date, y = state_new_cases, color = State)) +
geom_point() +
facet_grid(~State) +
theme_bw() +
ggtitle("New daily COVID cases in Florida, New York and Washington")# plot the cum_cases for a few counties to illustrate "breaks" in the cumulative variable (this should technically not be possible)
PA <- covid_rate %>%
filter(State %in% c("Pennsylvania"),
County %in% c("Lancaster", "Allegheny", "Philadelphia")) %>%
ggplot(aes(x = date, y = cum_cases, color = County)) +
geom_point() +
facet_grid(~County) +
theme_bw() +
ggtitle("New daily COVID cases in 3 counties of Pennsylvania")
MS <- covid_rate %>%
filter(State %in% c("Massachusetts"),
County %in% c("Suffolk", "Barnstable", "Middlesex")) %>%
ggplot(aes(x = date, y = cum_cases, color = County)) +
geom_point() +
facet_grid(~County) +
theme_bw() +
ggtitle("New daily COVID cases in 3 counties of Massachusetts")
ggpubr::ggarrange(PA, MS, nrow = 2)When using single day data, there is more variability in the number of new cases, making the trends harder to see. For instance, some days have close to 0 new cases, while other days contains up to 20000 new cases (Florida in January 2021). Thus, looking at the evolution of cases on a daily level may be too granular to get a good sense of COVID trends over time.
weekly_case_per100k. Plot the spaghetti plots of weekly_case_per100k by state. Use TotalPopEst2019 as population.We want to know the number of new cases per week and per 100k in each state. To do this, we sum the new cases up per state and week, divide it by the total population and then by 100’000.
# create total state population variable
total_pop <- covid_rate %>%
group_by(County, State) %>%
filter(row_number() == 1) %>%
group_by(State) %>%
mutate(state_pop = sum(TotalPopEst2019)) %>%
select(FIPS, state_pop)## Adding missing grouping variables: `State`
# merge to main dataframe
covid_rate <- merge(covid_rate, total_pop, all.x= T)
# create weekly new cases per state and per 100k
weekly_state <- covid_rate %>%
group_by(State, week) %>%
summarise(
state_weekly_new = sum(county_new_cases, na.rm = TRUE),
weekly_case_per100k = state_weekly_new / (first(state_pop) / 100000)
)## `summarise()` has grouped output by 'State'. You can override using the
## `.groups` argument.
# spaghetti plot of weekly_case_per100k for all States
weekly_state %>%
ggplot(aes(x = week, y = weekly_case_per100k, color = State)) +
geom_line() +
theme_bw() +
ggtitle("Weekly new cases per 100k in all States") +
theme(legend.position="bottom")# spaghetti plot of weekly_case_per100k for NY, FL, WA
weekly_state %>%
filter(State %in% c("New York", "Washington", "Florida")) %>%
ggplot(aes(x = week, y = weekly_case_per100k, color = State)) +
geom_line(size = 2) +
theme_bw() +
ggtitle("Weekly new cases per 100k in Florida, New York and Washington") +
theme(legend.position="bottom")In general, across all States we note 3-4 major waves across this time period: * Around week 10, corresponding to end of March, 2020 * Around week 25, corresponding to early July, 2020 * Around week 42 (major peak), corresponding to early November 2020 * Around week 46, corresponding to early December 2020
When plotting Washington, Florida and New York only, we note that while all 3 States had a wave of COVID cases in early December, New York had its own wave around week 10-11 (end of March, 2020) and Florida had its own wave around week 25-26 (July 2020).
These observations are likely due to a combination of changes in restrictions as well as major holidays and tourism/population flow. For instance, New York may have been initially more heavily impacted at the start of the pandemic in March 2020 due to the high number of travellers passing through NYC. Florida was more affected in summer, as its beaches attract more tourists. Finally, all States were impacted in early December as a consequence of gatherings during Thanksgiving.
covid_intervention to see whether the effectiveness of lockdown in flattening the curve.limits argument in scale_fill_gradient() or use facet_wrap(); use lubridate::month() and lubridate::year() to extract month and year from date; use tidyr::complete(state, month, fill = list(new_case_per100k = NA)) to complete the missing months with no cases.)# I'll first create a month and year variable for each row. This will help us pick out what we need.
covid_rate <- covid_rate %>%
mutate(YEAR = lubridate::year(date)) %>%
mutate(MONTH = lubridate::month(date))
covid_rate$MONTH <- month.abb[covid_rate$MONTH] # convert to month names
# Here I extract just the data from 2020
covid_rate_2020 <- covid_rate[covid_rate$YEAR == 2020, ]
# Create a daily deaths per
covid_rate_2020 <- covid_rate_2020 %>%
group_by(County, State) %>%
mutate(county_new_deaths = cum_deaths - lag(cum_deaths)) %>%
mutate(county_new_deaths = ifelse(is.na(county_new_deaths), cum_deaths, county_new_deaths)) # the first county_new_deaths entry for every county is an NA as there was no prior data point to compute a difference with. Thus, we replace NAs with the first reported number of deaths that day.
summary(covid_rate_2020$county_new_deaths) # again, we see that some are negative - this is due to later cum_deaths being less than preceding ones. This is likely due to correcting misreporting or errors in data entry. ## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -512 0 0 0 0 455
# To account for this, set negative values to zero.
covid_rate_2020$county_new_deaths <- ifelse((covid_rate_2020$county_new_deaths) < 0,
0, # if value is smaller than 0, attribute 0 to this cell
covid_rate_2020$county_new_deaths) # else, just report the current value
#summary(covid_rate_2020$county_new_deaths) # looks like we're in good shape!
# let's now get deaths per 100k population
covid_rate_2020 <- covid_rate_2020 %>%
mutate(deaths_per_100k = county_new_deaths/(state_pop/100000))
# now, let's create a new data table with just new deaths per month by state (rows: state, columns: months)
covid_deaths_2020 <- covid_rate_2020 %>%
group_by(State, MONTH) %>%
summarize(state_deaths_per_100k = sum(deaths_per_100k))## `summarise()` has grouped output by 'State'. You can override using the
## `.groups` argument.
covid_deaths_2020$MONTH <- factor(covid_deaths_2020$MONTH, levels = c(month.abb[1:12])) # make months factors and reorder. This will help later when using facet_wrap so that months are displayed in order
max_deaths <- max(covid_deaths_2020$state_deaths_per_100k) # get the upper limit for our color gradient
# filling missing months with NA
covid_deaths_2020 <- covid_deaths_2020 %>%
group_by(State) %>%
complete(MONTH) %>%
ungroup()
# now we can generate our heatmaps
colnames(covid_deaths_2020)[1] = 'state' # lowercase to match what plot_usmap expects
# create a plot for month
covid_map <- plot_usmap(data = covid_deaths_2020, values = "state_deaths_per_100k", regions = 'states', color = 'black', include = unique(covid_deaths_2020$state)) +
scale_fill_continuous(low = "white", high = "red4", name = "COVID-19 Deaths per 100k", limits = c(0, max_deaths)) +
labs(title = "COVID-19 Deaths per 100k Month-by-Month") +
theme(legend.position = "right") + facet_wrap(vars(MONTH))
covid_mapplotly to animate the monthly maps in i). Does it reveal any systematic way to capture the dynamic changes among states? (Hints: Follow Appendix 3: Plotly heatmap:: in Module 6 regularization lecture to plot the heatmap using plotly. Use frame argument in add_trace() for animation. plotly only recognizes abbreviation of state names. Use unique(us_map(regions = "states") %>% select(abbr, full)) to get the abbreviation and merge with the data to get state abbreviation.)# get US state abbreviations
state_abbr <- unique(us_map(regions = "states") %>% select(abbr, full))
# rename full to state
state_abbr <- state_abbr %>%
rename(state = full)
# merge with main dataframe
covid_deaths_2020 <- merge(covid_deaths_2020,
state_abbr,
all.x = T)
# add a variable hover to store the state description as text when you hover on a state
# <br> means new line in HTML
plotting_info <- covid_deaths_2020 %>%
mutate(hover = paste(state, '<br>',
'Deaths per 100k: ', round(state_deaths_per_100k, 1)))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = FALSE
)
fig <- plot_geo(plotting_info, locationmode = 'USA-states')
fig <- fig %>% add_trace(
z = ~state_deaths_per_100k, text = ~hover, locations = ~abbr,
color = ~state_deaths_per_100k, colors = 'Reds',
zauto = FALSE, zmin = 0, zmax = max_deaths,
frame = ~MONTH
)
fig <- fig %>% layout(
title = 'COVID-19 Deaths per 100k',
geo = g,
hoverlabel = list(bgcolor = "black")
)
figIt does reveal how the COVID wave spreads primarily from states with high international travel and large urban centers (i.e NYC), through neighboring states, and settles in rural states last.
We now try to build a good parsimonious model to find possible factors related to death rate on county level. Let us not take time series into account for the moment and use the total number as of Feb 1, 2021.
total_death_per100k as the total of number of COVID deaths per 100k by Feb 1, 2021. We suggest to take log transformation as log_total_death_per100k = log(total_death_per100k + 1). Merge total_death_per100k to county_data for the following analysis.# let's create a response variable. We'll use the log transformed deaths in our analysis.
# get just data for Feb 1 2021
feb1_data <- covid_rate[covid_rate$date == as.Date("2021-02-01"), ]
# create a total_death_per100k variable. Here we'll be using county population to make our calculation, as we're working at the county level.
feb1_data <- feb1_data %>%
mutate(total_death_per100k = cum_deaths/(TotalPopEst2019/100000))
# now, we can derive a log transform of our variable
feb1_data <- feb1_data %>%
mutate(log_total_death_per100k = log(total_death_per100k + 1))
# Append to the county_data data.frame. We'll use the FIPS code to link the new variables to the correct county.
county_data <- merge(x=county_data, y=feb1_data[, c("FIPS", "total_death_per100k", "log_total_death_per100k")], by="FIPS", all.x=TRUE)
summary(county_data[, c("total_death_per100k", "log_total_death_per100k")]) # let's take a look at our new appended variables ot make sure everything is ok. Looks like we have some NAs, though this is expected because there are more counties in county_data than in covid_rate. ## total_death_per100k log_total_death_per100k
## Min. : 0 Min. :0.0
## 1st Qu.: 81 1st Qu.:4.4
## Median :137 Median :4.9
## Mean :153 Mean :4.7
## 3rd Qu.:201 3rd Qu.:5.3
## Max. :835 Max. :6.7
## NA's :171 NA's :171
#length(unique(covid_rate$FIPS)) < length(county_data$FIPS)Select possible variables in county_data as covariates. We provide county_data_sub, a subset variables from county_data, for you to get started. Please add any potential variables as you wish.
Report missing values in your final subset of variables.
In the following anaylsis, you may ignore the missing values.
We are adding AvgHHSize, indicating the average household size, as crowdedness within households may potentially have an influence on COVID spreading.
county_data_sub <- county_data %>%
select(County, State, FIPS, # general info
Deep_Pov_All, PovertyAllAgesPct, PerCapitaInc, # income/poverty
UnempRate2019, PctEmpFIRE, PctEmpConstruction, PctEmpTrans, PctEmpMining, PctEmpTrade, PctEmpInformation, PctEmpAgriculture, PctEmpManufacturing, PctEmpServices, # employment
PopDensity2010, OwnHomePct, AvgHHSize, # housing (adding AvgHHSize here as crowdedness may have an influence on COVID spreading?)
Age65AndOlderPct2010, TotalPop25Plus, Under18Pct2010, # age
Ed2HSDiplomaOnlyPct, Ed3SomeCollegePct, Ed4AssocDegreePct, Ed5CollegePlusPct, # education
ForeignBornPct, Net_International_Migration_Rate_2010_2019, NetMigrationRate1019, # migration
NaturalChangeRate1019, TotalPopEst2019, # population size
WhiteNonHispanicPct2010, NativeAmericanNonHispanicPct2010, BlackNonHispanicPct2010, AsianNonHispanicPct2010, HispanicPct2010, # race & ethnicity
Type_2015_Update, RuralUrbanContinuumCode2013, UrbanInfluenceCode2013, # county type (rural vs urban)
Perpov_1980_0711, # persistent poverty counties
HiCreativeClass2000, HiAmenity, Retirement_Destination_2015_Update, # other
log_total_death_per100k) # dependent variable
# get a sense of the data subset
skim(county_data_sub)| Name | county_data_sub |
| Number of rows | 3279 |
| Number of columns | 44 |
| Key | NULL |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 42 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| County | 0 | 1 | FALSE | 1947 | Was: 32, Fra: 26, Jef: 26, Jac: 24 |
| State | 0 | 1 | FALSE | 53 | TX: 255, GA: 160, VA: 135, KY: 121 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| FIPS | 0 | 1.00 | 3.13e+04 | 1.63e+04 | 0.00 | 19020.00 | 30019.00 | 46101.50 | 7.22e+04 | ▅▇▇▇▁ |
| Deep_Pov_All | 6 | 1.00 | 7.15e+00 | 4.56e+00 | 0.00 | 4.50 | 6.16 | 8.22 | 4.40e+01 | ▇▂▁▁▁ |
| PovertyAllAgesPct | 86 | 0.97 | 1.51e+01 | 6.09e+00 | 2.60 | 10.80 | 14.10 | 18.30 | 5.40e+01 | ▆▇▂▁▁ |
| PerCapitaInc | 6 | 1.00 | 2.67e+04 | 6.95e+03 | 5974.00 | 22553.00 | 26169.00 | 30106.00 | 7.28e+04 | ▁▇▂▁▁ |
| UnempRate2019 | 7 | 1.00 | 4.14e+00 | 1.79e+00 | 0.70 | 3.00 | 3.70 | 4.70 | 1.93e+01 | ▇▃▁▁▁ |
| PctEmpFIRE | 7 | 1.00 | 4.57e+00 | 1.92e+00 | 0.00 | 3.33 | 4.29 | 5.49 | 2.06e+01 | ▇▇▁▁▁ |
| PctEmpConstruction | 7 | 1.00 | 7.28e+00 | 2.36e+00 | 0.00 | 5.74 | 6.98 | 8.48 | 2.55e+01 | ▂▇▁▁▁ |
| PctEmpTrans | 7 | 1.00 | 5.50e+00 | 2.07e+00 | 0.00 | 4.13 | 5.20 | 6.49 | 2.77e+01 | ▇▆▁▁▁ |
| PctEmpMining | 7 | 1.00 | 1.55e+00 | 3.42e+00 | 0.00 | 0.07 | 0.29 | 1.24 | 4.40e+01 | ▇▁▁▁▁ |
| PctEmpTrade | 7 | 1.00 | 1.37e+01 | 2.74e+00 | 0.84 | 12.27 | 13.86 | 15.30 | 3.89e+01 | ▁▇▂▁▁ |
| PctEmpInformation | 7 | 1.00 | 1.38e+00 | 8.00e-01 | 0.00 | 0.86 | 1.30 | 1.74 | 1.23e+01 | ▇▁▁▁▁ |
| PctEmpAgriculture | 7 | 1.00 | 4.97e+00 | 6.23e+00 | 0.00 | 1.10 | 2.78 | 6.08 | 5.96e+01 | ▇▁▁▁▁ |
| PctEmpManufacturing | 7 | 1.00 | 1.22e+01 | 7.09e+00 | 0.00 | 6.79 | 11.34 | 16.53 | 5.17e+01 | ▇▇▂▁▁ |
| PctEmpServices | 7 | 1.00 | 4.31e+01 | 7.03e+00 | 8.33 | 38.49 | 43.02 | 47.74 | 8.16e+01 | ▁▂▇▁▁ |
| PopDensity2010 | 6 | 1.00 | 2.85e+02 | 1.72e+03 | 0.04 | 17.79 | 47.05 | 131.29 | 6.95e+04 | ▇▁▁▁▁ |
| OwnHomePct | 6 | 1.00 | 7.13e+01 | 8.22e+00 | 4.26 | 67.36 | 72.42 | 76.91 | 9.24e+01 | ▁▁▁▇▅ |
| AvgHHSize | 6 | 1.00 | 2.53e+00 | 2.80e-01 | 1.34 | 2.36 | 2.49 | 2.65 | 4.97e+00 | ▁▇▁▁▁ |
| Age65AndOlderPct2010 | 6 | 1.00 | 1.58e+01 | 4.15e+00 | 3.47 | 13.06 | 15.41 | 18.08 | 4.34e+01 | ▂▇▂▁▁ |
| TotalPop25Plus | 6 | 1.00 | 2.01e+05 | 3.90e+06 | 69.00 | 7952.00 | 18362.00 | 47876.00 | 2.18e+08 | ▇▁▁▁▁ |
| Under18Pct2010 | 6 | 1.00 | 2.35e+01 | 3.34e+00 | 0.00 | 21.49 | 23.43 | 25.13 | 4.16e+01 | ▁▁▇▃▁ |
| Ed2HSDiplomaOnlyPct | 6 | 1.00 | 3.41e+01 | 7.17e+00 | 5.47 | 29.55 | 34.26 | 39.07 | 5.56e+01 | ▁▂▇▇▁ |
| Ed3SomeCollegePct | 6 | 1.00 | 2.16e+01 | 4.14e+00 | 4.12 | 18.98 | 21.58 | 24.02 | 3.87e+01 | ▁▂▇▂▁ |
| Ed4AssocDegreePct | 6 | 1.00 | 8.93e+00 | 2.64e+00 | 1.12 | 7.16 | 8.67 | 10.53 | 2.14e+01 | ▁▇▆▁▁ |
| Ed5CollegePlusPct | 6 | 1.00 | 2.17e+01 | 9.40e+00 | 0.00 | 15.19 | 19.47 | 26.01 | 7.85e+01 | ▃▇▂▁▁ |
| ForeignBornPct | 85 | 0.97 | 4.80e+00 | 5.74e+00 | 0.00 | 1.37 | 2.78 | 5.88 | 5.32e+01 | ▇▁▁▁▁ |
| Net_International_Migration_Rate_2010_2019 | 85 | 0.97 | 9.20e-01 | 1.60e+00 | -1.25 | 0.10 | 0.40 | 1.10 | 2.13e+01 | ▇▁▁▁▁ |
| NetMigrationRate1019 | 85 | 0.97 | 1.00e-02 | 7.58e+00 | -32.18 | -4.07 | -1.15 | 3.17 | 1.16e+02 | ▅▇▁▁▁ |
| NaturalChangeRate1019 | 85 | 0.97 | 1.03e+00 | 3.66e+00 | -11.03 | -1.36 | 0.59 | 3.01 | 2.31e+01 | ▁▇▃▁▁ |
| TotalPopEst2019 | 6 | 1.00 | 3.02e+05 | 5.87e+06 | 86.00 | 11317.00 | 26687.00 | 71686.00 | 3.28e+08 | ▇▁▁▁▁ |
| WhiteNonHispanicPct2010 | 6 | 1.00 | 7.63e+01 | 2.29e+01 | 0.20 | 65.34 | 84.99 | 94.00 | 9.92e+01 | ▁▁▂▃▇ |
| NativeAmericanNonHispanicPct2010 | 6 | 1.00 | 1.82e+00 | 7.47e+00 | 0.00 | 0.19 | 0.30 | 0.62 | 9.50e+01 | ▇▁▁▁▁ |
| BlackNonHispanicPct2010 | 6 | 1.00 | 8.57e+00 | 1.43e+01 | 0.00 | 0.38 | 1.79 | 9.53 | 8.54e+01 | ▇▁▁▁▁ |
| AsianNonHispanicPct2010 | 6 | 1.00 | 1.15e+00 | 2.54e+00 | 0.00 | 0.26 | 0.46 | 1.00 | 4.30e+01 | ▇▁▁▁▁ |
| HispanicPct2010 | 6 | 1.00 | 1.05e+01 | 1.90e+01 | 0.00 | 1.63 | 3.46 | 9.16 | 9.97e+01 | ▇▁▁▁▁ |
| Type_2015_Update | 136 | 0.96 | 1.81e+00 | 1.82e+00 | 0.00 | 0.00 | 1.00 | 3.00 | 5.00e+00 | ▇▁▂▂▂ |
| RuralUrbanContinuumCode2013 | 57 | 0.98 | 4.94e+00 | 2.72e+00 | 1.00 | 2.00 | 6.00 | 7.00 | 9.00e+00 | ▇▅▁▇▅ |
| UrbanInfluenceCode2013 | 57 | 0.98 | 5.19e+00 | 3.51e+00 | 1.00 | 2.00 | 5.00 | 8.00 | 1.20e+01 | ▇▂▃▂▃ |
| Perpov_1980_0711 | 136 | 0.96 | 1.10e-01 | 3.20e-01 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▁ |
| HiCreativeClass2000 | 140 | 0.96 | 2.50e-01 | 4.30e-01 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00e+00 | ▇▁▁▁▃ |
| HiAmenity | 172 | 0.95 | 2.50e-01 | 4.30e-01 | 0.00 | 0.00 | 0.00 | 0.50 | 1.00e+00 | ▇▁▁▁▃ |
| Retirement_Destination_2015_Update | 136 | 0.96 | 1.40e-01 | 3.50e-01 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00e+00 | ▇▁▁▁▁ |
| log_total_death_per100k | 171 | 0.95 | 4.75e+00 | 9.80e-01 | 0.00 | 4.41 | 4.93 | 5.31 | 6.73e+00 | ▁▁▂▇▂ |
Most variables have a number of missing values (ranging from around 5-170 NAs). As instructed, we will ignore these in building our model.
lambda.1se to choose a smaller model.State should be forced into the model because this is a critical variable of interest in our study. To avoid potentially discarding it, we use the penalty.factor argument from cv.glmnet to not impose any penalty on the coefficients of this variable.
# remove county variable as we don't want to include this in the model
county_data_sub <- county_data_sub %>%
select(-County, -FIPS)
# county_data_sub$Type_2015_Update <- as.factor(county_data_sub$Type_2015_Update) # set this variable to be a factor
# remove all rows containing NAs (LASSO does not work without this step)
county_data_sub <- na.omit(county_data_sub)
# create matrices to feed into gmnet
Y <- as.matrix(county_data_sub[, 42]) # extract Y (log_total_death_per100k)
X <- model.matrix(log_total_death_per100k~., data = county_data_sub)[, -1] # remove the first (interecept) column with only 1s
# checking data types (should be matrices)
# typeof(X)
# typeof(Y)
# to control the randomness in K folds
set.seed(10)
# building the penalty factor
length(levels(county_data_sub$State))## [1] 53
ncol(X)## [1] 92
length(c(rep(0, 53), rep(1, ncol(X)-53))) # repeat 0 for each of the 53 States (not penalized), repeat 1 for other variables (penalized)## [1] 92
# run LASSO
lasso_cv <- cv.glmnet(X, Y, alpha = 1, nfolds = 10, intercept = TRUE,
penalty.factor = c(rep(0, 53), rep(1, ncol(X)-53)))
# plot LASSO results
plot(lasso_cv)lasso_cv$lambda.1se # use lambda.1se to select the smaller model## [1] 0.0145
coef.1se <- coef(lasso_cv, s = "lambda.1se") # get coefficients
coef.1se <- coef.1se[which(coef.1se != 0),] # show variables with non-zero coefficients
# output variable names
coef.1se <- rownames(as.matrix(coef.1se))[-1]
coef.1se## [1] "StateAL" "StateAR"
## [3] "StateAZ" "StateCA"
## [5] "StateCO" "StateCT"
## [7] "StateDC" "StateDE"
## [9] "StateFL" "StateGA"
## [11] "StateIA" "StateID"
## [13] "StateIL" "StateIN"
## [15] "StateKS" "StateKY"
## [17] "StateLA" "StateMA"
## [19] "StateMD" "StateME"
## [21] "StateMI" "StateMN"
## [23] "StateMO" "StateMS"
## [25] "StateMT" "StateNC"
## [27] "StateND" "StateNE"
## [29] "StateNH" "StateNJ"
## [31] "StateNM" "StateNV"
## [33] "StateNY" "StateOH"
## [35] "StateOK" "StateOR"
## [37] "StatePA" "StateRI"
## [39] "StateSC" "StateSD"
## [41] "StateTN" "StateTX"
## [43] "StateUT" "StateVA"
## [45] "StateVT" "StateWA"
## [47] "StateWI" "StateWV"
## [49] "StateWY" "Deep_Pov_All"
## [51] "PctEmpConstruction" "PctEmpMining"
## [53] "PctEmpTrade" "PctEmpAgriculture"
## [55] "PctEmpManufacturing" "PopDensity2010"
## [57] "Age65AndOlderPct2010" "TotalPop25Plus"
## [59] "Under18Pct2010" "Ed3SomeCollegePct"
## [61] "Ed5CollegePlusPct" "NetMigrationRate1019"
## [63] "NaturalChangeRate1019" "WhiteNonHispanicPct2010"
## [65] "HispanicPct2010" "Type_2015_Update"
Cp or BIC to fine tune the LASSO model from iii). Again force in State in the process.Running regsubsets with State forced in causes a crash on the computers of all group members. To still be able to account for State effects, we ran regsubsets without State, and included it in our final model as a workaroud “forcing” of State.
# let's first see what the linear model looks like based on variables selected by LASSO
coef.1se <- coef(lasso_cv, s="lambda.1se") #s=c("lambda.1se","lambda.min") or lambda value
coef.1se <- coef.1se[which(coef.1se != 0),] # get the non=zero coefficients
var.1se <- rownames(as.matrix(coef.1se))[-1] # output the variable names without intercept
var.1se <- var.1se[-grep("State", var.1se)] # remove the state levels
county_data_lasso <- county_data_sub %>%
select(c("State", "log_total_death_per100k", var.1se)) # get a subset with response and LASSO output## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(var.1se)` instead of `var.1se` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
# run relaxed LASSO
fit.1se.lm <- lm(log_total_death_per100k~., data = county_data_lasso)
summary(fit.1se.lm) ##
## Call:
## lm(formula = log_total_death_per100k ~ ., data = county_data_lasso)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.829 -0.252 0.067 0.372 3.565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.45e+00 3.49e-01 12.73 < 2e-16 ***
## StateAR 5.68e-02 1.35e-01 0.42 0.67346
## StateAZ 2.47e-01 2.39e-01 1.03 0.30185
## StateCA -8.29e-01 1.58e-01 -5.23 1.8e-07 ***
## StateCO -3.50e-01 1.53e-01 -2.29 0.02188 *
## StateCT 1.26e-01 3.00e-01 0.42 0.67487
## StateDC 4.98e-01 8.06e-01 0.62 0.53707
## StateDE -5.26e-02 4.68e-01 -0.11 0.91065
## StateFL 8.02e-03 1.47e-01 0.05 0.95640
## StateGA 5.95e-02 1.17e-01 0.51 0.60956
## StateIA 2.70e-01 1.33e-01 2.02 0.04312 *
## StateID -3.05e-01 1.66e-01 -1.84 0.06648 .
## StateIL 1.88e-01 1.32e-01 1.43 0.15419
## StateIN -7.91e-02 1.34e-01 -0.59 0.55447
## StateKS -6.20e-01 1.37e-01 -4.51 6.6e-06 ***
## StateKY -7.60e-01 1.28e-01 -5.93 3.4e-09 ***
## StateLA 3.69e-01 1.43e-01 2.59 0.00975 **
## StateMA 3.81e-02 2.39e-01 0.16 0.87357
## StateMD -8.93e-02 1.94e-01 -0.46 0.64497
## StateME -1.38e+00 2.25e-01 -6.14 9.2e-10 ***
## StateMI -1.58e-01 1.35e-01 -1.16 0.24499
## StateMN -1.75e-01 1.37e-01 -1.28 0.20198
## StateMO -3.77e-01 1.29e-01 -2.93 0.00338 **
## StateMS 2.72e-01 1.32e-01 2.06 0.03909 *
## StateMT 5.49e-01 1.58e-01 3.47 0.00052 ***
## StateNC -3.70e-01 1.26e-01 -2.93 0.00344 **
## StateND 9.07e-01 1.63e-01 5.58 2.7e-08 ***
## StateNE -3.96e-01 1.43e-01 -2.78 0.00554 **
## StateNH -8.37e-01 2.73e-01 -3.07 0.00215 **
## StateNJ 2.79e-01 2.05e-01 1.36 0.17364
## StateNM -6.38e-01 1.92e-01 -3.32 0.00093 ***
## StateNV -5.99e-01 2.27e-01 -2.64 0.00841 **
## StateNY -3.88e-01 1.50e-01 -2.59 0.00969 **
## StateOH -5.40e-01 1.34e-01 -4.02 5.9e-05 ***
## StateOK -3.65e-01 1.37e-01 -2.66 0.00787 **
## StateOR -8.71e-01 1.74e-01 -5.00 6.2e-07 ***
## StatePA 1.77e-02 1.45e-01 0.12 0.90270
## StateRI -2.00e-01 3.71e-01 -0.54 0.59088
## StateSC -1.45e-02 1.53e-01 -0.10 0.92420
## StateSD 8.01e-01 1.51e-01 5.31 1.2e-07 ***
## StateTN 5.32e-02 1.30e-01 0.41 0.68323
## StateTX 1.68e-01 1.27e-01 1.32 0.18602
## StateUT -1.06e+00 1.95e-01 -5.44 5.8e-08 ***
## StateVA -4.84e-01 1.22e-01 -3.96 7.5e-05 ***
## StateVT -2.14e+00 2.38e-01 -8.99 < 2e-16 ***
## StateWA -8.69e-01 1.69e-01 -5.15 2.8e-07 ***
## StateWI -1.51e-01 1.40e-01 -1.08 0.28082
## StateWV -6.86e-01 1.56e-01 -4.41 1.1e-05 ***
## StateWY 2.05e-01 2.06e-01 0.99 0.32004
## Deep_Pov_All 7.07e-03 6.02e-03 1.17 0.24030
## PctEmpConstruction -4.83e-02 7.33e-03 -6.59 5.0e-11 ***
## PctEmpMining -1.94e-02 5.62e-03 -3.45 0.00057 ***
## PctEmpTrade 5.25e-03 6.10e-03 0.86 0.38965
## PctEmpAgriculture -3.82e-02 3.72e-03 -10.27 < 2e-16 ***
## PctEmpManufacturing 3.15e-03 3.37e-03 0.93 0.35044
## PopDensity2010 2.14e-05 9.34e-06 2.29 0.02203 *
## Age65AndOlderPct2010 3.21e-02 7.61e-03 4.22 2.6e-05 ***
## TotalPop25Plus 9.45e-08 7.74e-08 1.22 0.22269
## Under18Pct2010 5.83e-02 7.63e-03 7.64 2.9e-14 ***
## Ed3SomeCollegePct -2.14e-02 5.37e-03 -3.99 6.9e-05 ***
## Ed5CollegePlusPct -1.47e-02 2.70e-03 -5.45 5.4e-08 ***
## NetMigrationRate1019 -1.22e-02 2.56e-03 -4.77 2.0e-06 ***
## NaturalChangeRate1019 -4.49e-02 9.93e-03 -4.52 6.3e-06 ***
## WhiteNonHispanicPct2010 -2.86e-03 1.59e-03 -1.81 0.07117 .
## HispanicPct2010 8.87e-03 2.17e-03 4.09 4.4e-05 ***
## Type_2015_Update -1.67e-02 8.72e-03 -1.92 0.05497 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.79 on 3039 degrees of freedom
## Multiple R-squared: 0.359, Adjusted R-squared: 0.345
## F-statistic: 26.2 on 65 and 3039 DF, p-value: <2e-16
### TO DO: use Cp or BIC to fine-tune the LASSO model.
county_data_lasso <- data.frame(county_data_lasso) # convert from data.table to data.frame
county_data_lasso_sub <- county_data_lasso[ ,c("log_total_death_per100k", var.1se)] # extract response & LASSO output vars
county_data_lasso_fine_tune <- regsubsets(log_total_death_per100k ~., nvmax = 17, method = "exhau", county_data_lasso_sub) # get Cp and BIC
# Plot the Cp and BIC
plot(summary(county_data_lasso_fine_tune)$cp)plot(summary(county_data_lasso_fine_tune)$bic)optimal_size = 9 # 9 Seems to be a good number
final_variables <- colnames(summary(county_data_lasso_fine_tune)$which)[summary(county_data_lasso_fine_tune)$which[optimal_size, ]][-1] # extract the variables in the optimal model, minus the intercept
county_data_sub <- data.frame(county_data_sub) # convert to data.frame from data.table
final_covid_fit <- lm(log_total_death_per100k ~., data = data.frame(county_data_lasso[, c("log_total_death_per100k", final_variables, 'State')]))
summary(final_covid_fit)##
## Call:
## lm(formula = log_total_death_per100k ~ ., data = data.frame(county_data_lasso[,
## c("log_total_death_per100k", final_variables, "State")]))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.991 -0.251 0.071 0.376 3.306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.95768 0.27895 17.77 < 2e-16 ***
## PctEmpConstruction -0.04873 0.00685 -7.12 1.4e-12 ***
## PctEmpAgriculture -0.04144 0.00329 -12.61 < 2e-16 ***
## Age65AndOlderPct2010 0.05685 0.00559 10.17 < 2e-16 ***
## Under18Pct2010 0.04613 0.00663 6.96 4.2e-12 ***
## Ed3SomeCollegePct -0.02762 0.00509 -5.42 6.4e-08 ***
## Ed5CollegePlusPct -0.01641 0.00213 -7.70 1.8e-14 ***
## NetMigrationRate1019 -0.01391 0.00248 -5.60 2.3e-08 ***
## WhiteNonHispanicPct2010 -0.00586 0.00121 -4.85 1.3e-06 ***
## Type_2015_Update -0.02249 0.00853 -2.64 0.00845 **
## StateAR 0.06476 0.13488 0.48 0.63116
## StateAZ 0.27706 0.23260 1.19 0.23369
## StateCA -0.68707 0.14812 -4.64 3.7e-06 ***
## StateCO -0.30943 0.14663 -2.11 0.03491 *
## StateCT 0.18282 0.30107 0.61 0.54373
## StateDC 0.36069 0.80785 0.45 0.65528
## StateDE -0.10779 0.47098 -0.23 0.81899
## StateFL 0.01972 0.14110 0.14 0.88887
## StateGA 0.04653 0.11663 0.40 0.68996
## StateIA 0.25760 0.13131 1.96 0.04988 *
## StateID -0.31048 0.16300 -1.90 0.05690 .
## StateIL 0.20584 0.12979 1.59 0.11285
## StateIN -0.05201 0.13253 -0.39 0.69478
## StateKS -0.63547 0.13375 -4.75 2.1e-06 ***
## StateKY -0.73627 0.12621 -5.83 6.0e-09 ***
## StateLA 0.19692 0.14003 1.41 0.15976
## StateMA 0.06177 0.23892 0.26 0.79601
## StateMD -0.14589 0.19160 -0.76 0.44645
## StateME -1.33921 0.22477 -5.96 2.8e-09 ***
## StateMI -0.11054 0.13429 -0.82 0.41049
## StateMN -0.22770 0.13415 -1.70 0.08974 .
## StateMO -0.39089 0.12601 -3.10 0.00194 **
## StateMS 0.20356 0.13197 1.54 0.12306
## StateMT 0.43946 0.15511 2.83 0.00464 **
## StateNC -0.35701 0.12667 -2.82 0.00486 **
## StateND 0.65224 0.15542 4.20 2.8e-05 ***
## StateNE -0.43943 0.13895 -3.16 0.00158 **
## StateNH -0.78376 0.27333 -2.87 0.00417 **
## StateNJ 0.32943 0.20306 1.62 0.10484
## StateNM -0.50530 0.17558 -2.88 0.00403 **
## StateNV -0.71279 0.21940 -3.25 0.00117 **
## StateNY -0.33743 0.14421 -2.34 0.01936 *
## StateOH -0.52161 0.13357 -3.91 9.6e-05 ***
## StateOK -0.46516 0.13444 -3.46 0.00055 ***
## StateOR -0.80458 0.17145 -4.69 2.8e-06 ***
## StatePA -0.00349 0.14320 -0.02 0.98057
## StateRI -0.14001 0.37267 -0.38 0.70717
## StateSC -0.03072 0.15347 -0.20 0.84134
## StateSD 0.65010 0.14657 4.44 9.5e-06 ***
## StateTN 0.11310 0.13001 0.87 0.38441
## StateTX 0.18528 0.11311 1.64 0.10150
## StateUT -1.15587 0.19042 -6.07 1.4e-09 ***
## StateVA -0.51712 0.12083 -4.28 1.9e-05 ***
## StateVT -2.13762 0.23909 -8.94 < 2e-16 ***
## StateWA -0.83391 0.16582 -5.03 5.2e-07 ***
## StateWI -0.14654 0.13874 -1.06 0.29094
## StateWV -0.76369 0.15038 -5.08 4.0e-07 ***
## StateWY -0.00250 0.19852 -0.01 0.98993
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.797 on 3047 degrees of freedom
## Multiple R-squared: 0.347, Adjusted R-squared: 0.335
## F-statistic: 28.4 on 57 and 3047 DF, p-value: <2e-16
Anova(final_covid_fit)## Anova Table (Type II tests)
##
## Response: log_total_death_per100k
## Sum Sq Df F value Pr(>F)
## PctEmpConstruction 32 1 50.66 1.4e-12 ***
## PctEmpAgriculture 101 1 159.08 < 2e-16 ***
## Age65AndOlderPct2010 66 1 103.47 < 2e-16 ***
## Under18Pct2010 31 1 48.44 4.2e-12 ***
## Ed3SomeCollegePct 19 1 29.40 6.4e-08 ***
## Ed5CollegePlusPct 38 1 59.33 1.8e-14 ***
## NetMigrationRate1019 20 1 31.40 2.3e-08 ***
## WhiteNonHispanicPct2010 15 1 23.48 1.3e-06 ***
## Type_2015_Update 4 1 6.94 0.0084 **
## State 480 48 15.74 < 2e-16 ***
## Residuals 1934 3047
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# generating diagnostic plots
par(mfrow = c(1,2), mar = c(5,2,4,2), mgp = c(3,0.5,0)) # plot(fit3) produces several plots
plot(final_covid_fit, 1, pch = 16) # residual plot
abline(h = 0, col = "blue", lwd = 2)
plot(final_covid_fit, 2) # qqplotThe assumptions of the model are not met. This is likely due to the influence of outliers. We will identify and remove outliers from our model to achieve a better fit. We define outliers as having a residual more than two sigma away from the mean.
# identify & remove outliers
county_data_lasso$Residuals <- final_covid_fit$resid
county_data_lasso_fix <- county_data_lasso%>%
subset(abs(Residuals) < 2*sigma(final_covid_fit))#county_data_lasso_fix <- county_data_lasso%>%
# subset(log_total_death_per100k != 0)
# run model without outliers
final_covid_fit_nooutlier <- lm(log_total_death_per100k ~., data = data.frame(county_data_lasso_fix[, c("log_total_death_per100k", final_variables, 'State')]))
summary(final_covid_fit_nooutlier)##
## Call:
## lm(formula = log_total_death_per100k ~ ., data = data.frame(county_data_lasso_fix[,
## c("log_total_death_per100k", final_variables, "State")]))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8382 -0.2519 0.0392 0.2889 1.4613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.447251 0.168791 32.27 < 2e-16 ***
## PctEmpConstruction -0.036428 0.004223 -8.63 < 2e-16 ***
## PctEmpAgriculture -0.009246 0.002325 -3.98 7.2e-05 ***
## Age65AndOlderPct2010 0.043838 0.003397 12.90 < 2e-16 ***
## Under18Pct2010 0.020710 0.004017 5.16 2.7e-07 ***
## Ed3SomeCollegePct -0.021521 0.003116 -6.91 6.1e-12 ***
## Ed5CollegePlusPct -0.015556 0.001305 -11.92 < 2e-16 ***
## NetMigrationRate1019 -0.009417 0.001565 -6.02 2.0e-09 ***
## WhiteNonHispanicPct2010 -0.005322 0.000715 -7.44 1.3e-13 ***
## Type_2015_Update -0.024471 0.005051 -4.84 1.3e-06 ***
## StateAR -0.036499 0.078983 -0.46 0.64403
## StateAZ 0.253151 0.135976 1.86 0.06274 .
## StateCA -0.710891 0.088257 -8.05 1.1e-15 ***
## StateCO -0.200156 0.089302 -2.24 0.02508 *
## StateCT 0.172045 0.175698 0.98 0.32756
## StateDC 0.226135 0.470932 0.48 0.63113
## StateDE -0.158420 0.274555 -0.58 0.56398
## StateFL -0.138327 0.082813 -1.67 0.09496 .
## StateGA 0.028126 0.068547 0.41 0.68160
## StateIA 0.109206 0.077122 1.42 0.15688
## StateID -0.390694 0.097074 -4.02 5.8e-05 ***
## StateIL 0.146597 0.076234 1.92 0.05458 .
## StateIN -0.066225 0.077618 -0.85 0.39361
## StateKS -0.331273 0.083299 -3.98 7.1e-05 ***
## StateKY -0.646224 0.074695 -8.65 < 2e-16 ***
## StateLA 0.166753 0.081927 2.04 0.04190 *
## StateMA 0.308847 0.143681 2.15 0.03167 *
## StateMD -0.198367 0.111952 -1.77 0.07652 .
## StateME -1.257539 0.138418 -9.09 < 2e-16 ***
## StateMI -0.190444 0.078670 -2.42 0.01555 *
## StateMN -0.301229 0.079481 -3.79 0.00015 ***
## StateMO -0.456713 0.074057 -6.17 7.9e-10 ***
## StateMS 0.187224 0.077248 2.42 0.01542 *
## StateMT 0.010787 0.093696 0.12 0.90835
## StateNC -0.449910 0.074227 -6.06 1.5e-09 ***
## StateND 0.358870 0.093093 3.85 0.00012 ***
## StateNE -0.305483 0.083312 -3.67 0.00025 ***
## StateNH -0.875603 0.159502 -5.49 4.4e-08 ***
## StateNJ 0.361277 0.118745 3.04 0.00237 **
## StateNM -0.226628 0.105958 -2.14 0.03253 *
## StateNV -0.454156 0.138233 -3.29 0.00103 **
## StateNY -0.376259 0.084472 -4.45 8.7e-06 ***
## StateOH -0.490389 0.078408 -6.25 4.6e-10 ***
## StateOK -0.559284 0.078773 -7.10 1.6e-12 ***
## StateOR -1.010320 0.101879 -9.92 < 2e-16 ***
## StatePA -0.025382 0.083893 -0.30 0.76225
## StateRI 0.264681 0.240662 1.10 0.27151
## StateSC -0.064037 0.089763 -0.71 0.47566
## StateSD 0.362568 0.087245 4.16 3.3e-05 ***
## StateTN 0.049355 0.076127 0.65 0.51682
## StateTX 0.088226 0.066619 1.32 0.18550
## StateUT -0.749708 0.115380 -6.50 9.5e-11 ***
## StateVA -0.572648 0.070915 -8.08 9.7e-16 ***
## StateVT -1.955884 0.153985 -12.70 < 2e-16 ***
## StateWA -0.851474 0.101510 -8.39 < 2e-16 ***
## StateWI -0.284528 0.081551 -3.49 0.00049 ***
## StateWV -0.567691 0.089171 -6.37 2.2e-10 ***
## StateWY -0.219357 0.116384 -1.88 0.05956 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.464 on 2929 degrees of freedom
## Multiple R-squared: 0.52, Adjusted R-squared: 0.511
## F-statistic: 55.6 on 57 and 2929 DF, p-value: <2e-16
Anova(final_covid_fit_nooutlier)## Anova Table (Type II tests)
##
## Response: log_total_death_per100k
## Sum Sq Df F value Pr(>F)
## PctEmpConstruction 16 1 74.4 < 2e-16 ***
## PctEmpAgriculture 3 1 15.8 7.2e-05 ***
## Age65AndOlderPct2010 36 1 166.5 < 2e-16 ***
## Under18Pct2010 6 1 26.6 2.7e-07 ***
## Ed3SomeCollegePct 10 1 47.7 6.1e-12 ***
## Ed5CollegePlusPct 31 1 142.1 < 2e-16 ***
## NetMigrationRate1019 8 1 36.2 2.0e-09 ***
## WhiteNonHispanicPct2010 12 1 55.3 1.3e-13 ***
## Type_2015_Update 5 1 23.5 1.3e-06 ***
## State 314 48 30.3 < 2e-16 ***
## Residuals 631 2929
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# generating diagnostic plots
par(mfrow = c(1,2), mar = c(5,2,4,2), mgp = c(3,0.5,0)) # plot(fit3) produces several plots
plot(final_covid_fit_nooutlier, 1, pch = 16) # residual plot
abline(h = 0, col = "blue", lwd = 2)
plot(final_covid_fit_nooutlier, 2) # qqplotNow that we have removed outliers, model assumptions are reasonably met.
Our model confirms that COVID death rates are increased in the elderly, indicating that this portion of the population is more vulnerable to the virus. However, our model does not explicitly/directly suggest that the death rate is higher in African Americans and Latinos. Instead, we can see that being white non-hispanic is associated with a slightly lower mortality rate. This suggests that comparatively, other minority groups fare worse in terms of mortality.
The state effect appears to be quite important, with several states being associated with higher mortality (e.g. South Dakota), and several being associated with lower mortality (e.g. Vermont). It is our recommendation that interventions should be handled on a state-by-state basis, and federally determined policies may not be beneficial across the board. Additionally, it appears that age is a major factor, with higher numbers of children and people over the age of 65 both being associated with a higher mortality. The effect of children is likely due to the fact that vaccine approval for children required more time, as well as schooling and play serving as possible transmission vectors. Of course, since individuals over the age of 65 are at higher risk of death from COVID due to immune function decay, it is not surprising that higher amounts of these individuals are associated with higher mortality. Policy interventions may focus on protecting individuals over the age of 65 (via healthcare subsidies and facility availability), and making vaccines more available among children now that approval has been cleared. Higher education is associated with reduced mortality, indicating a possible intervention route as increasing higher education in some capacity, possibly via subsidies and incentives programs. The percent of people employed in construction and agriculture are both associated with lower mortality, which may be surprising at first, but both of these trades involve being mostly outdoors and routinely require the use of face masks despite sometimes elevated levels of person-to-person contact, which may explain some of the effect. This may imply that an outdoor setting and/or wearing face masks are indeed effective at reducing transmission. The economic type of a county also seems to affect mortality, though this may be captured by the percent in construction and agriculture. Percent of white non-hispanic is associated with reduced mortality, indicating that all minority groups may be at higher risk, and interventions should be crafted around improving the conditions of minority groups. Additionally, net migration rate between 2010-2019 is associated with reduced mortality - while it is unclear what sort of policy intervention may be informed by this, it may be explained by the fact that the majority of people who migrate are young adults, who are more resilient to COVID (https://www.census.gov/content/dam/Census/library/publications/2015/acs/acs-31.pdf).
The following information may be helpful in improving out model:
softImpute.A detailed summary of the variables in each data set follows:
Infection and fatality data
Socioeconomic demographics
Income: Poverty level and household income
Jobs: Employment type, rate, and change
UnempRate2007-2019: Unemployment rate, 2007-2019
NumEmployed2007-2019: Employed, 2007-2019
NumUnemployed2007-2019: Unemployed, 2007-2019
PctEmpChange1019: Percent employment change, 2010-19
PctEmpChange1819: Percent employment change, 2018-19
PctEmpChange0719: Percent employment change, 2007-19
PctEmpChange0710: Percent employment change, 2007-10
NumCivEmployed: Civilian employed population 16 years and over, 2014-18
NumCivLaborforce2007-2019: Civilian labor force, 2007-2019
PctEmpFIRE: Percent of the civilian labor force 16 and over employed in finance and insurance, and real estate and rental and leasing, 2014-18
PctEmpConstruction: Percent of the civilian labor force 16 and over employed in construction, 2014-18
PctEmpTrans: Percent of the civilian labor force 16 and over employed in transportation, warehousing and utilities, 2014-18
PctEmpMining: Percent of the civilian labor force 16 and over employed in mining, quarrying, oil and gas extraction, 2014-18
PctEmpTrade: Percent of the civilian labor force 16 and over employed in wholesale and retail trade, 2014-18
PctEmpInformation: Percent of the civilian labor force 16 and over employed in information services, 2014-18
PctEmpAgriculture: Percent of the civilian labor force 16 and over employed in agriculture, forestry, fishing, and hunting, 2014-18
PctEmpManufacturing: Percent of the civilian labor force 16 and over employed in manufacturing, 2014-18
PctEmpServices: Percent of the civilian labor force 16 and over employed in services, 2014-18
PctEmpGovt: Percent of the civilian labor force 16 and over employed in public administration, 2014-18
People: Population size, density, education level, race, age, household size, and migration rates
PopDensity2010: Population density, 2010
LandAreaSQMiles2010: Land area in square miles, 2010
TotalHH: Total number of households, 2014-18
TotalOccHU: Total number of occupied housing units, 2014-18
AvgHHSize: Average household size, 2014-18
OwnHomeNum: Number of owner occupied housing units, 2014-18
OwnHomePct: Percent of owner occupied housing units, 2014-18
NonEnglishHHPct: Percent of non-English speaking households of total households, 2014-18
HH65PlusAlonePct: Percent of persons 65 or older living alone, 2014-18
FemaleHHPct: Percent of female headed family households of total households, 2014-18
FemaleHHNum: Number of female headed family households, 2014-18
NonEnglishHHNum: Number of non-English speaking households, 2014-18
HH65PlusAloneNum: Number of persons 65 years or older living alone, 2014-18
Age65AndOlderPct2010: Percent of population 65 or older, 2010
Age65AndOlderNum2010: Population 65 years or older, 2010
TotalPop25Plus: Total population 25 and older, 2014-18 - 5-year average
Under18Pct2010: Percent of population under age 18, 2010
Under18Num2010: Population under age 18, 2010
Ed1LessThanHSPct: Percent of persons with no high school diploma or GED, adults 25 and over, 2014-18
Ed2HSDiplomaOnlyPct: Percent of persons with a high school diploma or GED only, adults 25 and over, 2014-18
Ed3SomeCollegePct: Percent of persons with some college experience, adults 25 and over, 2014-18
Ed4AssocDegreePct: Percent of persons with an associate’s degree, adults 25 and over, 2014-18
Ed5CollegePlusPct: Percent of persons with a 4-year college degree or more, adults 25 and over, 2014-18
Ed1LessThanHSNum: No high school, adults 25 and over, 2014-18
Ed2HSDiplomaOnlyNum: High school only, adults 25 and over, 2014-18
Ed3SomeCollegeNum: Some college experience, adults 25 and over, 2014-18
Ed4AssocDegreeNum: Number of persons with an associate’s degree, adults 25 and over, 2014-18
Ed5CollegePlusNum: College degree 4-years or more, adults 25 and over, 2014-18
ForeignBornPct: Percent of total population foreign born, 2014-18
ForeignBornEuropePct: Percent of persons born in Europe, 2014-18
ForeignBornMexPct: Percent of persons born in Mexico, 2014-18
ForeignBornCentralSouthAmPct: Percent of persons born in Central or South America, 2014-18
ForeignBornAsiaPct: Percent of persons born in Asia, 2014-18
ForeignBornCaribPct: Percent of persons born in the Caribbean, 2014-18
ForeignBornAfricaPct: Percent of persons born in Africa, 2014-18
ForeignBornNum: Number of people foreign born, 2014-18
ForeignBornCentralSouthAmNum: Number of persons born in Central or South America, 2014-18
ForeignBornEuropeNum: Number of persons born in Europe, 2014-18
ForeignBornMexNum: Number of persons born in Mexico, 2014-18
ForeignBornAfricaNum: Number of persons born in Africa, 2014-18
ForeignBornAsiaNum: Number of persons born in Asia, 2014-18
ForeignBornCaribNum: Number of persons born in the Caribbean, 2014-18
Net_International_Migration_Rate_2010_2019: Net international migration rate, 2010-19
Net_International_Migration_2010_2019: Net international migration, 2010-19
Net_International_Migration_2000_2010: Net international migration, 2000-10
Immigration_Rate_2000_2010: Net international migration rate, 2000-10
NetMigrationRate0010: Net migration rate, 2000-10
NetMigrationRate1019: Net migration rate, 2010-19
NetMigrationNum0010: Net migration, 2000-10
NetMigration1019: Net Migration, 2010-19
NaturalChangeRate1019: Natural population change rate, 2010-19
NaturalChangeRate0010: Natural population change rate, 2000-10
NaturalChangeNum0010: Natural change, 2000-10
NaturalChange1019: Natural population change, 2010-19
TotalPop2010: Population size 4/1/2010 Census
TotalPopEst2010: Population size 7/1/2010
TotalPopEst2011: Population size 7/1/2011
TotalPopEst2012: Population size 7/1/2012
TotalPopEst2013: Population size 7/1/2013
TotalPopEst2014: Population size 7/1/2014
TotalPopEst2015: Population size 7/1/2015
TotalPopEst2016: Population size 7/1/2016
TotalPopEst2017: Population size 7/1/2017
TotalPopEst2018: Population size 7/1/2018
TotalPopEst2019: Population size 7/1/2019
TotalPopACS: Total population, 2014-18 - 5-year average
TotalPopEstBase2010: County Population estimate base 4/1/2010
NonHispanicAsianPopChangeRate0010: Population change rate Non-Hispanic Asian, 2000-10
PopChangeRate1819: Population change rate, 2018-19
PopChangeRate1019: Population change rate, 2010-19
PopChangeRate0010: Population change rate, 2000-10
NonHispanicNativeAmericanPopChangeRate0010: Population change rate Non-Hispanic Native American, 2000-10
HispanicPopChangeRate0010: Population change rate Hispanic, 2000-10
MultipleRacePopChangeRate0010: Population change rate multiple race, 2000-10
NonHispanicWhitePopChangeRate0010: Population change rate Non-Hispanic White, 2000-10
NonHispanicBlackPopChangeRate0010: Population change rate Non-Hispanic African American, 2000-10
MultipleRacePct2010: Percent multiple race, 2010
WhiteNonHispanicPct2010: Percent Non-Hispanic White, 2010
NativeAmericanNonHispanicPct2010: Percent Non-Hispanic Native American, 2010
BlackNonHispanicPct2010: Percent Non-Hispanic African American, 2010
AsianNonHispanicPct2010: Percent Non-Hispanic Asian, 2010
HispanicPct2010: Percent Hispanic, 2010
MultipleRaceNum2010: Population size multiple race, 2010
WhiteNonHispanicNum2010: Population size Non-Hispanic White, 2010
BlackNonHispanicNum2010: Population size Non-Hispanic African American, 2010
NativeAmericanNonHispanicNum2010: Population size Non-Hispanic Native American, 2010
AsianNonHispanicNum2010: Population size Non-Hispanic Asian, 2010
HispanicNum2010: Population size Hispanic, 2010
County classifications: Type of county (rural or urban on a rural-urban continuum scale)
Type_2015_Recreation_NO: Recreation counties, 2015 edition
Type_2015_Farming_NO: Farming-dependent counties, 2015 edition
Type_2015_Mining_NO: Mining-dependent counties, 2015 edition
Type_2015_Government_NO: Federal/State government-dependent counties, 2015 edition
Type_2015_Update: County typology economic types, 2015 edition
Type_2015_Manufacturing_NO: Manufacturing-dependent counties, 2015 edition
Type_2015_Nonspecialized_NO: Nonspecialized counties, 2015 edition
RecreationDependent2000: Nonmetro recreation-dependent, 1997-00
ManufacturingDependent2000: Manufacturing-dependent, 1998-00
FarmDependent2003: Farm-dependent, 1998-00
EconomicDependence2000: Economic dependence, 1998-00
RuralUrbanContinuumCode2003: Rural-urban continuum code, 2003
UrbanInfluenceCode2003: Urban influence code, 2003
RuralUrbanContinuumCode2013: Rural-urban continuum code, 2013
UrbanInfluenceCode2013: Urban influence code, 2013
Noncore2013: Nonmetro noncore, outside Micropolitan and Metropolitan, 2013
Micropolitan2013: Micropolitan, 2013
Nonmetro2013: Nonmetro, 2013
Metro2013: Metro, 2013
Metro_Adjacent2013: Nonmetro, adjacent to metro area, 2013
Noncore2003: Nonmetro noncore, outside Micropolitan and Metropolitan, 2003
Micropolitan2003: Micropolitan, 2003
Metro2003: Metro, 2003
Nonmetro2003: Nonmetro, 2003
NonmetroNotAdj2003: Nonmetro, nonadjacent to metro area, 2003
NonmetroAdj2003: Nonmetro, adjacent to metro area, 2003
Oil_Gas_Change: Change in the value of onshore oil and natural gas production, 2000-11
Gas_Change: Change in the value of onshore natural gas production, 2000-11
Oil_Change: Change in the value of onshore oil production, 2000-11
Hipov: High poverty counties, 2014-18
Perpov_1980_0711: Persistent poverty counties, 2015 edition
PersistentChildPoverty_1980_2011: Persistent child poverty counties, 2015 edition
PersistentChildPoverty2004: Persistent child poverty counties, 2004
PersistentPoverty2000: Persistent poverty counties, 2004
Low_Education_2015_update: Low education counties, 2015 edition
LowEducation2000: Low education, 2000
HiCreativeClass2000: Creative class, 2000
HiAmenity: High natural amenities
RetirementDestination2000: Retirement destination, 1990-00
Low_Employment_2015_update: Low employment counties, 2015 edition
Population_loss_2015_update: Population loss counties, 2015 edition
Retirement_Destination_2015_Update: Retirement destination counties, 2015 edition
The raw data sets are dirty and need transforming before we can do our EDA. It takes time and efforts to clean and merge different data sources so we provide the final output of the cleaned and merged data. The cleaning procedure is as follows. Please read through to understand what is in the cleaned data. We set eval = data_cleaned in the following cleaning chunks so that these cleaning chunks will only run if any of data/covid_county.csv, data/covid_rates.csv or data/covid_intervention.csv does not exist.
# Indicator to check whether the data files exist
data_cleaned <- !(file.exists("data/covid_county.csv") &
file.exists("data/covid_rates.csv") &
file.exists("data/covid_intervention.csv"))We first read in the table using data.table::fread(), as we did last time.
# COVID case/mortality rate data
covid_rates <- fread("data/us_counties.csv", na.strings = c("NA", "", "."))
nyc <- fread("data/nycdata.csv", na.strings = c("NA", "", "."))
# Socioeconomic data
income <- fread("data/income.csv", na.strings = c("NA", "", "."))
jobs <- fread("data/jobs.csv", na.strings = c("NA", "", "."))
people <- fread("data/people.csv", na.strings = c("NA", "", "."))
county_class <- fread("data/county_classifications.csv", na.strings = c("NA", "", "."))
# Internvention policy data
int_dates <- fread("data/intervention_dates.csv", na.strings = c("NA", "", "."))The original NYC data contains more information than we need. We extract only the number of cases and deaths and format it the same as the covid_rates data.
# NYC county fips matching table
nyc_fips <- data.table(FIPS = c('36005', '36047', '36061', '36081', '36085'),
County = c("BX", "BK", "MN", "QN", "SI"))
# nyc case
nyc_case <- nyc[,.(date = as.Date(date_of_interest, "%m/%d/%Y"),
BX = BX_CASE_COUNT,
BK = BK_CASE_COUNT,
MN = MN_CASE_COUNT,
QN = QN_CASE_COUNT,
SI = SI_CASE_COUNT)]
nyc_case %<>%
pivot_longer(cols = BX:SI,
names_to = "County",
values_to = "cases") %>%
arrange(date) %>%
group_by(County) %>%
mutate(cum_cases = cumsum(cases))
# nyc death
nyc_death <- nyc[,.(date = as.Date(date_of_interest, "%m/%d/%Y"),
BX = BX_DEATH_COUNT,
BK = BK_DEATH_COUNT,
MN = MN_DEATH_COUNT,
QN = QN_DEATH_COUNT,
SI = SI_DEATH_COUNT)]
nyc_death %<>%
pivot_longer(cols = BX:SI,
names_to = "County",
values_to = "deaths") %>%
arrange(date) %>%
group_by(County) %>%
mutate(cum_deaths = cumsum(deaths))
nyc_rates <- merge(nyc_case,
nyc_death,
by = c("date", "County"),
all.x= T)
nyc_rates <- merge(nyc_rates,
nyc_fips,
by = "County")
nyc_rates$State <- "New York"
nyc_rates %<>%
select(date, FIPS, County, State, cum_cases, cum_deaths) %>%
arrange(FIPS, date)We only consider cases and death in continental US. Alaska, Hawaii, and Puerto Rico have 02, 15, and 72 as their respective first 2 digits of their FIPS. We use the %/% operator for integer division to get the first 2 digits of FIPS. We also remove Virgin Islands and Northern Mariana Islands. All data of counties in NYC are aggregated as County == "New York City" in covid_rates with no FIPS, so we combine the NYC data into covid_rate.
covid_rates <- covid_rates %>%
arrange(fips, date) %>%
filter(!(fips %/% 1000 %in% c(2, 15, 72))) %>%
filter(county != "New York City") %>%
filter(!(state %in% c("Virgin Islands", "Northern Mariana Islands"))) %>%
rename(FIPS = "fips",
County = "county",
State = "state",
cum_cases = "cases",
cum_deaths = "deaths")
covid_rates$date <- as.Date(covid_rates$date)
covid_rates <- rbind(covid_rates,
nyc_rates)We set the week of Jan 21, 2020 (the first case of COVID case in US) as the first week (2020-01-19 to 2020-01-25).
covid_rates[, week := (interval("2020-01-19", date) %/% weeks(1)) + 1]Merge the TotalPopEst2019 variable from the demographic data with covid_rates by FIPS.
covid_rates <- merge(covid_rates[!is.na(FIPS)],
people[,.(FIPS = as.character(FIPS),
TotalPopEst2019)],
by = "FIPS",
all.x = TRUE)NA values in the covid_rates data set correspond to a county not having confirmed cases/deaths. We replace the NA values in these columns with zeros. FIPS for Kansas city, Missouri, Rhode Island and some others are missing. We drop them for the moment and output the data up to week 57 as covid_rates.csv.
covid_rates$cum_cases[is.na(covid_rates$cum_cases)] <- 0
covid_rates$cum_deaths[is.na(covid_rates$cum_deaths)] <- 0fwrite(covid_rates %>%
filter(week < 58) %>%
arrange(FIPS, date),
"data/covid_rates.csv")int_datesWe convert the columns representing dates in int_dates to R Date types using as.Date(). We will need to specify that the origin parameter is "0001-01-01". We output the data as covid_intervention.csv.
int_dates <- int_dates[-1,]
date_cols <- names(int_dates)[-(1:3)]
int_dates[, (date_cols) := lapply(.SD, as.Date, origin = "0001-01-01"),
.SDcols = date_cols]
fwrite(int_dates, "data/covid_intervention.csv")Merge the demographic data sets by FIPS and output as covid_county.csv.
countydata <-
merge(x = income,
y = merge(
x = people,
y = jobs,
by = c("FIPS", "State", "County")),
by = c("FIPS", "State", "County"),
all = TRUE)
countydata <-
merge(
x = countydata,
y = county_class %>% rename(FIPS = FIPStxt),
by = c("FIPS", "State", "County"),
all = TRUE
)
# Check dimensions
# They are now 3279 x 208
dim(countydata)
fwrite(countydata, "data/covid_county.csv")